INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

Size: px
Start display at page:

Download "INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING"

Transcription

1 INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING TAK-LAM WONG, KAI-ON CHOW, FRANZ WONG Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Abstract: ming causes serious problems in the Internet resulting in a huge waste of resources and attracting high attention from research society. Automatic document classification and keyword-based filtering are two kinds of techniques which have been applied to filter s to achieve satisfactory results. This paper proposes a formal method by incorporating keyword-based filtering to document classification. To consider the potentially high cost of misclassification of an to a in real word situation, a cost-sensitive evaluation metric is adopted to evaluate our approaches. We conducted extensive experiments in real-word data showing promising results. Keywords: filtering; Documents classification; Keyword-based filtering; Cost-sensitive evaluation 1. Introduction The increasing number of ming on the Internet has generated a large amount of negative impacts [7,10]. For example, ming can lead to a huge waste of processing power, transmission bandwidth, and storage space. ming, which is the delivery of s by unauthorized parties for mostly advertisement purpose, is one of the most common form of ming. Figure 1 shows an excerpt of a . A normally contains advertising content such as a list of brand names. In 1997, about 10% of s received by corporate networks are [6]. However, in 2003, Dunn reports that 75% of s in an account are s [8]. This raises the need for effective measures to reduce the s. The rationale of filtering is to identify s and not let them be delivered to users. Document classification techniques, which aims at categorizing documents into different predefined categories, have been applied to filtering with satisfactory results. By making use of a machine learning approach, a document classifier is automatically trained from a set of training examples. One advantage of automatic document classification is that they can handle the uncertainty involved effectively. However, unlike ordinary document classification, in which the costs for misclassifying documents to different classes are the same, a serious problem of filtering is in the high cost of classifying a ham , which refers to the sent by a known authority, to . For instance, a retailer may lose customers if the filtering method incorrectly classifies some s from customer as s and hence the retailer cannot provide immediate response to customers. On the contrary, classifying a to a ham may only produce browsing trouble to users with relatively limited impact. Original replica Rolex and other watches for men and ladies from $ Follow this promotional link to see lowest prices. BMW Breguet... Vacheron Constantin Vip Rolex -- Phone: Mobile: troublemakeraffects@telesp.net.br Figure 1. An excerpt of a Keyword-based filtering is another approach for filtering with effective results and is commonly employed in commercial software. The idea is to maintain X/07/$ IEEE 3899

2 a list of words and s that contain words are classified as s. The advantage of this approach is that it is light-weight and effective. Also, it allows users to modify the list of words based on their prior knowledge. However, keyword-based filtering is considered ad-hoc and is unable to handle uncertainty because it makes hard decision in the filtering process. Software products that combine the two techniques are not commonly seen. In this paper, we develop an approach to formally incorporate keyword-based filtering to different document classification techniques, in particular, multinominal naïve Bayesian and K-nearest neighbour. Our approaches can soften the hard decision made by keyword-based filtering, yet preserve its effectiveness. To tackle the different costs of the misclassification of a ham to a and the misclassification of a to a ham , a cost-sensitive method is used to evaluate different filtering techniques. We have conducted experiments on real-world data to evaluate our approaches in tackling ming, and considering the costs of different types of misclassification. Empirical results demonstrate that our approaches are more effective compared with approaches using document classification alone. 2. Related Works Several document classification based techniques have been proposed for filtering. Sahami et al. applied Bayesian approach to filter s [11]. Their approach considers a set of problem-specific features such as the domain type of the senders. For example, it is unlikely that s are sent from.edu domains. Carreras and Marquez proposed a method using boosting techniques to improve the performance [4]. Their approach makes use of the AdaBoost algorithm to discover a set of weak rules and then combines several weak rules to form accurate rules for filtering s. Androutsopoulos et al. investigated the effect of the size of feature set, the size of training set, the use lemmatization, and the use of stop words in training naïve Bayesian classifiers for filtering [1]. Sakkis et al. employed memory-based learning method, which corresponds to the K-nearest neighbour algorithm, for the task [12]. Different factors such as the value of K, the features adopted, etc. have been investigated in their works. A detailed comparison between naïve Bayesian and memory-based methods has been reported in [2]. Statistical techniques have also been employed to train document classifiers for filtering [3]. One characteristic of these document classification techniques is that they can effectively handle the uncertainty involved in the problem. Besides document classification techniques, rule induction techniques have been proposed for filtering s. Cohen proposed a method for learning filtering rules from a set of training examples [5]. The rules are keyword-spotting rules that classify s based on keywords contained in different fields of s. Hidalgo et al. decided and developed a system with a set of heuristics for detecting s [9]. Such heuristics include some special characters, for example, #,!, contained in s. These methods are mainly based on keyword matching in s. An advantage of applying filtering rules to filtering is that they are fast and effective. However, the rules are quite restrictive and make hard decision only in filtering. 3. Document Classification 3.1. Multinominal Naïve Bayesian α c Figure 2. A generative model for generating s. Spam filtering can be considered to be a document classification problem, in which a document refers to an and classes include and ham . An can be considered as a bag of words. The generation process can be represented by the generative model depicted in Figure 2. This model can represent the dependence of different variables. Shaded and unshaded nodes correspond to observed and unobserved variables respectively. A block refers to the repetition of the variables inside. In the model, a multinominal parameter α specifies a multinomial distribution which generates the class of an c. Together with another parameter β, c describes another multinomial distribution which generates the N words w contained in an . Suppose a collection contains M β w N M 3900

3 s, the generation process repeats M times. Since the s are given in filtering, w is an observed variable in the model. However, α, β, and c are unobservable because their values are hidden for an . Based on the generative model, the probability of generating an d i given the parameters α and β, is expressed as follows: Ni P( di α, β ) = P( c α) P( wk c; β ) (1) k = 1 where w k refers the k-th word in the . N i denotes the number of words of the i-th in the collection. Recall that our objective in filtering is to determine the class of a given . Essentially, we determine the probability for the class of the document d i belonging to the class c j given the document and the parameters, P(c=c j d i ;α,β). According to Bayes theorem, we have: Ni P( c = cj di; α, β ) P( c = cj α ) P( wk cj; β ) (2) k = 1 Once we obtain the values of P(c α) and P(w c,β), we can infer the class label of an . We determine the value of these probabilities as follows: Let V be the vocabulary of words contained in all s. Then we have: 1+ f ( i= 1 P( c = cj α ) = (3) M + c where M is the number of s in the collection; f(d i,c j ) is an indicator function which equals to 1 if document d i is labeled as class c j, and equals to 0 otherwise; c denotes the number of classes in the underlying problem. In filtering c is equal to 2 because the classes include ham and . Next, we have: M M 1+ g( wk, i= 1 P( wk cj; β ) = (4) V M V + g( ws, s= 1 i= 1 where V denotes the size of the vocabulary and g(d i,w k,c j ) is an indicator function which equals to 1 if d i is labeled as c j and contains the word w k, while equals to 0 otherwise K-Nearest Neighbour K-nearest neighbour (K-NN) is a classification technique based on similarity measure between instances and has been applied in document classification. In filtering, each can be represented by a vector d k The t-th entry of d i, denoted by w it, refers to the weighting of the t-th word in the vocabulary V in the . We define the similarity sim(d p,d q ), known as the cosine similarity, between two s d p and d q as follows: dp dq sim( dp, dq) = (5) dp dq where d refers to the norm of the vector d. Suppose we have a collection of s denoted by D. If we want to determine the class label of the d i, we first compute the similarity of d i between each of the s in D. Next the K most similar s will be identified. The score of d i belonging to class c j is defined as follows: score ( = sim( dp) y( dp, (6) dp KNN where KNN refers to the set of K-nearest neighbours in D, and y(d p,c j ) is an indicator function which equals to 1 if d p is labeled as c j, and equals to 0 otherwise. Finally, the class labeled of d i is decided as follows: c = max{ score( } (7) c j 4. Incorporating Keyword-Based Filtering in Different Classifiers Keyword-based filtering is commonly applied to existing commercial filtering system because it is fast and effective. The idea of keyword-based filtering is to check if an contains words. If an contains any words, it will be considered to be a . Table 1 depicts a list of words. This list is also used in the experiments in this paper. Though keyword-based filtering is effective, it is ad-hoc, makes hard decisions, and is unlikely to handle uncertainty. As a consequence, we incorporate keyword-based filtering in multinominal naïve Bayesian and K-NN classifiers to improve the effectiveness and yet handle the uncertainty involved for filtering. Table 1. A list of words for s filtering gamble gambling casino poker pill drug medicine viagra sex fuck suck porn lonely girl penis nude adult lesbian virus 100% earn money credit card cheap cash insurance free undeliverable mail 3901

4 Recall that a multinominal naïve Bayesian classifier is characterized by a set of probabilities P(c α) and P(w c;β). P(w c;β) can be interpreted as how likely an of class c will contain the word w. For example, if a number of s contain a word, such as Viagra, while only a few ham s contain this word, P(w= viagra c=;β) will be significantly higher than P(w= viagra c=ham;β). We incorporate keyword-based filtering to naïve Bayesian filtering based on this idea. For each of the words listed in Table 1, we set P(w=w c=;β) to a higher value and reduce the values of P(w c=;β) for other words accordingly; while set P(w=w c=ham;β) to a smaller value and increase the values of P(w c=ham;β) for other words accordingly, where w is a word. To incorporate keyword-based filtering in K-NN, we artificially create s, each of which only consists of a word. Each document is represented by a vector in which each entry is the weight for a word in the vocabulary V. Next, we set the weight for the word in the artificial to a higher value. These documents are labeled with s. To determine the class of an d i, the similarity between d i and all the s in D, as well as the similarity between d i and the artificial s are calculated. The class label is then determined according to Equations 6 and 7. Consequently, an that contains words will be more likely to be classified as s by using multinominal naïve Bayesian or K-NN classifiers alone. On the other hand, it can soften the decision made by using only keyword-based filtering. 5. Experimental Results We evaluate our proposed methods in the TREC 2005 Spam Corpus provided by Text Retrieval Conference. It consists of s in which about 57% are s. As discussed in Section 1, about 70% of s to an account are. To simulate the real situation, we randomly select s to construct the training set and s to construct the testing set. Both training set and testing set contain 70% of s and 30% of ham s. Each is preprocessed by removing the stop words and conducting stemming to each word. We also use Chi-Square method for feature selection and the resulting features contain about 400 words. For K-NN, K is set to 5 and we adopt the commonly used term frequency-inverse document frequency (TF-IDF) to calculate the weight for each word. After that, four sets of experiments were conducted for evaluation. The first and second sets of experiments apply multinominal naïve Bayesian and K-NN approaches respectively for filtering s. These two sets of experiments can be treated as the baselines of our methods of incorporating keyword-based filtering. The third and fourth sets are to incorporate keyword-based filtering to multinominal naïve Bayesian and to K-NN as described in Section 4 respectively. The words are listed in Table 1. We adopt macro F-measure (F macro ) and micro F-measure (F micro ) as the evaluation metrics in our experiments. However, one problem of F-measure is that it does not consider the costs for different types of misclassification. As exemplified in Section 1, normally the cost of classifying a ham to a is relatively higher. As a result, we use another cost-sensitive metric for our evaluation. Let λ be a parameter denoting the cost of classifying a ham to in relation to the cost of classifying a to a ham . For example, λ=9 if classifying a ham to a is 9 times more expensive than classifying a to a ham . Next, we define the weighted error rate (Err λ ) as follows: λ nham + n ham Errλ = (8) λ N + N ham where n ham and n ham refer to the number of s incorrectly classified to s and the number of s incorrectly classified to ham respectively. N ham and N refer to the total number of ham s and the total number of s respectively. We also define the baseline error (Err λ b ) as follows: N b Err = λ λ N + N (9) ham Err b λ can be interpreted as the error for the case that when there is no filters and all the s are considered as ham . Next the total cost ratio (TCR λ ) is a single value presenting the comparison to the baseline: b Errγ N TCRγ = = (10) Err λ n n γ ham + ham The higher the TCR λ of an filter, the better its performance is. If TCR λ is less than 1, it means that the filter is worse than the case that no filters are present. 3902

5 Table 2. The experimental results of different approaches in filtering evaluated using TCR λ. Bayes and K-NN refer to the multinominal naïve Bayesian and K-nearest approaches respectively. Bayes++ and K-NN++ refer to the approaches of incorporating keyword-based filtering to multinominal naïve Bayesian and K-NN respectively. Bayes K-NN Bayes++ K-NN++ TCR TCR Table 2 shows the experimental results of different approaches for the cases when λ=1 and λ=9 respectively. Bayes and K-NN refer to multinominal naïve Bayesian and K-nearest neighbour approaches respectively. Bayes++ and K-NN++ refer to the approaches that incorporate keyword-based filtering to multinominal naïve Bayesian and K-NN respectively. Recall that the higher the value of TCR λ, the better the performance of the filter, and the higher the value of λ, the higher the cost of misclassification of a ham to . It can be observed that when the cost of misclassification of a ham to a and that of misclassification of a to ham are the same (i.e. λ=1), the performance of incorporating keyword-based filtering is just comparable to the performance of not incorporating keyword-based filtering. However, when we set a higher cost for the misclassification of a ham to a , TCR 9 of K-NN++ is 0.14 higher than TCR 9 of KNN; while TCR 9 of Bayes++ is 0.1 higher than TCR 9 of Bayes. Our approaches of incorporating keyword-based filtering outperform the approaches which do not incorporate keyword-based filtering. Furthermore, it can be observed that multinominal naïve Bayesian is more sensitive to the cost of misclassification of a ham to a . When λ is increased from 1 to 9, the value TCR λ is changed from 8.89 to 3.21 and 8.84 to 3.31 for including and excluding keyword-based filtering respectively. On the other hand, K-NN is more stable with respect to the cost of misclassification. Table 3 shows the experimental results of different approaches evaluated using F-measure. Recall that F-measure (F macro and F micro ) sets the same cost to the misclassification of a ham to , and the misclassification of a to ham . It can be observed that the results of Bayes++ are similar to that of Bayes; while the results of K-NN are similar to that of K-NN++. Table 3. The experimental results of different approaches in filtering evaluated using F-measure. Bayes and K-NN refer to the multinominal naïve Bayesian and K-nearest approaches respectively. Bayes++ and K-NN++ refer to the approaches of incorporating keyword-based filtering to multinominal naïve Bayesian and K-NN respectively. Bayes K-NN Bayes++ K-NN++ F macro F micro Conclusions and Future Work We propose approaches for formally incorporating keyword-based filtering, which is fast and effective but is considered ad-hoc, to document classification techniques, in particular, multinominal naïve Bayesian and K-nearest neighbour. One property of our approaches is to soften the hard decision made by keyword-based filtering. In order to effectively evaluate the performance of our approaches, we use a cost-sensitive evaluation method which considers different costs for misclassification of a ham to a and misclassification of a to a ham . Empirical results show that our approaches achieve better performance when setting a higher cost for misclassification of a ham to . We intend to extend our work in several directions. One possible direction is to automatically discover the words from s using machine learning approaches. Methods such as latent semantic indexing have been applied to clustering words in documents of similar topics. It is likely that we can employ similar technique to automatically discover the words from s. Another direction is to increase the expressiveness of the list of words. For example, words may be organized as an ontology and provide richer information to the filtering system. References [1] I. Androutsopoulos, J. Kout, K. Chandrinos, G. Paliouras, and C. Spyropoulos, An evaluation of naive Bayesian anti- filtering, Proceedings of the Workshop on Machine Learning in the New Information Age, pp. 9-17, [2] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. Spyropoulos, and P. Stamatopoulos 3903

6 Learning to filter A comparison of a naive bayesian and a memorybased approach, Proceedings of the Fourth PKDD s Workshop on Machine Learning and Textual Information Access, pp. 1-13, [3] A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B. Zupan, Spam filtering using statistical data compression models, Journal of Machine Learning Research, Vol 7, No. 12, pp , [4] X. Carreras and L. Marquez. Boosting trees for anti- filtering, Proceedings of the fourth International Conference on Recent Advances in Natural Language Processing, pp , [5] W. Cohen, Learning rules that classify , Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, pp , [6] L. F. Cranor and B. A. LaMacchia, Spam!, Communications of the ACM, Vol 41, No. 8, pp , [7] B. Davison, M. Najork, and T. Converse, Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, Technical Report LU-CSE , Department of Computer Science and Engineering, Lehigh University, 2006 [8] J. E. Dunn, Spam some figures on the threat, id=1372, [9] J. Hidalgo, M. Lopez, and E. Sanz, Combining text and heuristics for cost-sensitive filtering, Proceedings of The Fourth Conference on Computational Language Learning and the Second Learning Language in Logic Workshop, pp , [10] G. Mishne, D. Carmel, and R. Lempel, Blocking blog with language model disagreement, Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, pp. 1-6, [11] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, A Bayesian approach to filtering junk , Proceedings of AAAI-98 Workshop on Learning for Text Categorization, pp , [12] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. Spyropoulos, and P. Stamatopoulos, A memorybased approach to anti- filtering for mailing lists, Information Retrieval, Vol 6, No. 1, pp ,

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of

More information

A Comparative Study of Classification Based Personal Filtering

A Comparative Study of Classification Based Personal  Filtering A Comparative Study of Classification Based Personal E-mail Filtering Yanlei Diao, Hongjun Lu and Deai Wu Department of Computer Science The Hong Kong University of Science and Technology Clear Water Bay,

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

Keywords : Bayesian,  classification, tokens, text, probability, keywords. GJCST-C Classification: E.5 Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Filtering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4

Filtering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4 Filtering Spam Using Fuzzy Expert System 1 Siham A. M. Almasan, 2 Wadeea A. A. Qaid, 3 Ahmed Khalid, 4 Ibrahim A. A. Alqubati 1, 2 Hodeidah University, Faculty of computer science and engineering, Yemen

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

A Three-Way Decision Approach to Spam Filtering

A Three-Way Decision Approach to  Spam Filtering A Three-Way Decision Approach to Email Spam Filtering Bing Zhou, Yiyu Yao, and Jigang Luo Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {zhou200b,yyao,luo226}@cs.uregina.ca

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

CHEAP, efficient and easy to use, has become an

CHEAP, efficient and easy to use,  has become an Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 A Multi-Resolution-Concentration Based Feature Construction Approach for Spam Filtering Guyue Mi,

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Filtering Spam by Using Factors Hyperbolic Trees

Filtering Spam by Using Factors Hyperbolic Trees Filtering Spam by Using Factors Hyperbolic Trees Hailong Hou*, Yan Chen, Raheem Beyah, Yan-Qing Zhang Department of Computer science Georgia State University P.O. Box 3994 Atlanta, GA 30302-3994, USA *Contact

More information

Decision Science Letters

Decision Science Letters Decision Science Letters 3 (2014) 439 444 Contents lists available at GrowingScience Decision Science Letters homepage: www.growingscience.com/dsl Identifying spam e-mail messages using an intelligence

More information

Bayesian Spam Detection System Using Hybrid Feature Selection Method

Bayesian Spam Detection System Using Hybrid Feature Selection Method 2016 International Conference on Manufacturing Science and Information Engineering (ICMSIE 2016) ISBN: 978-1-60595-325-0 Bayesian Spam Detection System Using Hybrid Feature Selection Method JUNYING CHEN,

More information

A Reputation-based Collaborative Approach for Spam Filtering

A Reputation-based Collaborative Approach for Spam Filtering Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for

More information

Text Categorization (I)

Text Categorization (I) CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization

More information

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST Enrico Blanzieri and Anton Bryl May 2007 Technical Report # DIT-07-025 Evaluation of the Highest

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

On Effective Classification via Neural Networks

On Effective  Classification via Neural Networks On Effective E-mail Classification via Neural Networks Bin Cui 1, Anirban Mondal 2, Jialie Shen 3, Gao Cong 4, and Kian-Lee Tan 1 1 Singapore-MIT Alliance, National University of Singapore {cuibin, tankl}@comp.nus.edu.sg

More information

SPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition

SPAM, generally defined as unsolicited bulk  (UBE) Feature Construction Approach for  Categorization Based on Term Space Partition Feature Construction Approach for Email Categorization Based on Term Space Partition Guyue Mi, Pengtao Zhang and Ying Tan Abstract This paper proposes a novel feature construction approach based on term

More information

JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters

JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters Igor Santos, Carlos Laorden, Borja Sanz, Pablo G. Bringas S 3 Lab, DeustoTech

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Spam Classification Documentation

Spam Classification Documentation Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM http:// GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM Akshay Kumar 1, Vibhor Harit 2, Balwant Singh 3, Manzoor Husain Dar 4 1 M.Tech (CSE), Kurukshetra University, Kurukshetra,

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

A probabilistic description-oriented approach for categorising Web documents

A probabilistic description-oriented approach for categorising Web documents A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

A generalized additive neural network application in information security

A generalized additive neural network application in information security Lecture Notes in Management Science (2014) Vol. 6: 58 64 6 th International Conference on Applied Operational Research, Proceedings Tadbir Operational Research Group Ltd. All rights reserved. www.tadbir.ca

More information

An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack

An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack Steve Webb, Subramanyam Chitti, and Calton Pu {webb, chittis, calton}@cc.gatech.edu College of Computing Georgia Institute

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

Content Based Spam Filtering

Content Based Spam  Filtering 2016 International Conference on Collaboration Technologies and Systems Content Based Spam E-mail Filtering 2nd Author Pingchuan Liu and Teng-Sheng Moh Department of Computer Science San Jose State University

More information

SNS College of Technology, Coimbatore, India

SNS College of Technology, Coimbatore, India Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Schematizing a Global SPAM Indicative Probability

Schematizing a Global SPAM Indicative Probability Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens,

More information

arxiv: v1 [cs.lg] 12 Feb 2018

arxiv: v1 [cs.lg] 12 Feb 2018 Email Classification into Relevant Category Using Neural Networks arxiv:1802.03971v1 [cs.lg] 12 Feb 2018 Deepak Kumar Gupta & Shruti Goyal Co-Founders: Reckon Analytics deepak@reckonanalytics.com & shruti@reckonanalytics.com

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Generating Estimates of Classification Confidence for a Case-Based Spam Filter

Generating Estimates of Classification Confidence for a Case-Based Spam Filter Generating Estimates of Classification Confidence for a Case-Based Spam Filter Sarah Jane Delany 1, Pádraig Cunningham 2, and Dónal Doyle 2 1 Dublin Institute of Technology, Kevin Street, Dublin 8, Ireland

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Project Report: "Bayesian Spam Filter"

Project Report: Bayesian  Spam Filter Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Chapter-8. Conclusion and Future Scope

Chapter-8. Conclusion and Future Scope Chapter-8 Conclusion and Future Scope This thesis has addressed the problem of Spam E-mails. In this work a Framework has been proposed. The proposed framework consists of the three pillars which are Legislative

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Classification and Clustering

Classification and Clustering Chapter 9 Classification and Clustering Classification and Clustering Classification/clustering are classical pattern recognition/ machine learning problems Classification, also referred to as categorization

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

2. Design Methodology

2. Design Methodology Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily

More information

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS Juan Martinez-Romo and Lourdes Araujo Natural Language Processing and Information Retrieval Group at UNED * nlp.uned.es Fifth International Workshop

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Collaborative Spam Mail Filtering Model Design

Collaborative Spam Mail Filtering Model Design I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

Multi-Dimensional Text Classification

Multi-Dimensional Text Classification Multi-Dimensional Text Classification Thanaruk THEERAMUNKONG IT Program, SIIT, Thammasat University P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani, Thailand, 12121 ping@siit.tu.ac.th Verayuth LERTNATTEE

More information

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document BayesTH-MCRDR Algorithm for Automatic Classification of Web Document Woo-Chul Cho and Debbie Richards Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {wccho, richards}@ics.mq.edu.au

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

Final Report - Smart and Fast Sorting

Final Report - Smart and Fast  Sorting Final Report - Smart and Fast Email Sorting Antonin Bas - Clement Mennesson 1 Project s Description Some people receive hundreds of emails a week and sorting all of them into different categories (e.g.

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Bayesian Spam Detection

Bayesian Spam Detection Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

Bayesian Spam Filtering Using Statistical Data Compression

Bayesian Spam Filtering Using Statistical Data Compression Global Journal of researches in engineering Numerical Methods Volume 11 Issue 7 Version 1.0 December 2011 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.

More information

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11 Nearest Neighbour Classifier Keywords: K Neighbours, Weighted, Nearest Neighbour 1 Nearest neighbour classifiers This is amongst the simplest

More information

An Assessment of Case Base Reasoning for Short Text Message Classification

An Assessment of Case Base Reasoning for Short Text Message Classification An Assessment of Case Base Reasoning for Short Text Message Classification Matt Healy 1, Sarah Jane Delany 1, and Anton Zamolotskikh 2 1 Dublin Institute of Technology, Kevin Street, Dublin 8, Ireland

More information

Probabilistic Anti-Spam Filtering with Dimensionality Reduction

Probabilistic Anti-Spam Filtering with Dimensionality Reduction Probabilistic Anti-Spam Filtering with Dimensionality Reduction ABSTRACT One of the biggest problems of e-mail communication is the massive spam message delivery Everyday billion of unwanted messages are

More information

Text Classification. Dr. Johan Hagelbäck.

Text Classification. Dr. Johan Hagelbäck. Text Classification Dr. Johan Hagelbäck johan.hagelback@lnu.se http://aiguy.org Document Classification A very common machine learning problem is to classify a document based on its text contents We use

More information

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines SemDeep-4, Oct. 2018 Gengchen Mai Krzysztof Janowicz Bo Yan STKO Lab, University of California, Santa Barbara

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Cost-sensitive Boosting for Concept Drift

Cost-sensitive Boosting for Concept Drift Cost-sensitive Boosting for Concept Drift Ashok Venkatesan, Narayanan C. Krishnan, Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing, School of Computing, Informatics and Decision Systems

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

Improving the methods of classification based on words ontology

Improving the methods of  classification based on words ontology www.ijcsi.org 262 Improving the methods of email classification based on words ontology Foruzan Kiamarzpour 1, Rouhollah Dianat 2, Mohammad bahrani 3, Mehdi Sadeghzadeh 4 1 Department of Computer Engineering,

More information

Annotated Suffix Trees for Text Clustering

Annotated Suffix Trees for Text Clustering Annotated Suffix Trees for Text Clustering Ekaterina Chernyak and Dmitry Ilvovsky National Research University Higher School of Economics Moscow, Russia echernyak,dilvovsky@hse.ru Abstract. In this paper

More information

CS 188: Artificial Intelligence Fall Machine Learning

CS 188: Artificial Intelligence Fall Machine Learning CS 188: Artificial Intelligence Fall 2007 Lecture 23: Naïve Bayes 11/15/2007 Dan Klein UC Berkeley Machine Learning Up till now: how to reason or make decisions using a model Machine learning: how to select

More information

CS294-1 Final Project. Algorithms Comparison

CS294-1 Final Project. Algorithms Comparison CS294-1 Final Project Algorithms Comparison Deep Learning Neural Network AdaBoost Random Forest Prepared By: Shuang Bi (24094630) Wenchang Zhang (24094623) 2013-05-15 1 INTRODUCTION In this project, we

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Spring 2011 Introduction to Artificial Intelligence Practice Final Exam To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 3 or more hours on the

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

High Reliability Text Categorisation Systems

High Reliability Text Categorisation Systems University of Cagliari Department of Electrical and Electronic Engineering High Reliability Text Categorisation Systems Doctoral Thesis of: Dott. Ing. Ignazio Pillai Tutor: Prof. Ing. Fabio Roli Dottorato

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Natthakul Pingclasai Department of Computer Engineering Kasetsart University Bangkok, Thailand Email: b5310547207@ku.ac.th Hideaki

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information