INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

Size: px

Start display at page:

Download "INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING"

Edmund Woods
5 years ago
Views:

1 INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING TAK-LAM WONG, KAI-ON CHOW, FRANZ WONG Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Abstract: ming causes serious problems in the Internet resulting in a huge waste of resources and attracting high attention from research society. Automatic document classification and keyword-based filtering are two kinds of techniques which have been applied to filter s to achieve satisfactory results. This paper proposes a formal method by incorporating keyword-based filtering to document classification. To consider the potentially high cost of misclassification of an to a in real word situation, a cost-sensitive evaluation metric is adopted to evaluate our approaches. We conducted extensive experiments in real-word data showing promising results. Keywords: filtering; Documents classification; Keyword-based filtering; Cost-sensitive evaluation 1. Introduction The increasing number of ming on the Internet has generated a large amount of negative impacts [7,10]. For example, ming can lead to a huge waste of processing power, transmission bandwidth, and storage space. ming, which is the delivery of s by unauthorized parties for mostly advertisement purpose, is one of the most common form of ming. Figure 1 shows an excerpt of a . A normally contains advertising content such as a list of brand names. In 1997, about 10% of s received by corporate networks are [6]. However, in 2003, Dunn reports that 75% of s in an account are s [8]. This raises the need for effective measures to reduce the s. The rationale of filtering is to identify s and not let them be delivered to users. Document classification techniques, which aims at categorizing documents into different predefined categories, have been applied to filtering with satisfactory results. By making use of a machine learning approach, a document classifier is automatically trained from a set of training examples. One advantage of automatic document classification is that they can handle the uncertainty involved effectively. However, unlike ordinary document classification, in which the costs for misclassifying documents to different classes are the same, a serious problem of filtering is in the high cost of classifying a ham , which refers to the sent by a known authority, to . For instance, a retailer may lose customers if the filtering method incorrectly classifies some s from customer as s and hence the retailer cannot provide immediate response to customers. On the contrary, classifying a to a ham may only produce browsing trouble to users with relatively limited impact. Original replica Rolex and other watches for men and ladies from $ Follow this promotional link to see lowest prices. BMW Breguet... Vacheron Constantin Vip Rolex -- Phone: Mobile: troublemakeraffects@telesp.net.br Figure 1. An excerpt of a Keyword-based filtering is another approach for filtering with effective results and is commonly employed in commercial software. The idea is to maintain X/07/$ IEEE 3899

2 a list of words and s that contain words are classified as s. The advantage of this approach is that it is light-weight and effective. Also, it allows users to modify the list of words based on their prior knowledge. However, keyword-based filtering is considered ad-hoc and is unable to handle uncertainty because it makes hard decision in the filtering process. Software products that combine the two techniques are not commonly seen. In this paper, we develop an approach to formally incorporate keyword-based filtering to different document classification techniques, in particular, multinominal naïve Bayesian and K-nearest neighbour. Our approaches can soften the hard decision made by keyword-based filtering, yet preserve its effectiveness. To tackle the different costs of the misclassification of a ham to a and the misclassification of a to a ham , a cost-sensitive method is used to evaluate different filtering techniques. We have conducted experiments on real-world data to evaluate our approaches in tackling ming, and considering the costs of different types of misclassification. Empirical results demonstrate that our approaches are more effective compared with approaches using document classification alone. 2. Related Works Several document classification based techniques have been proposed for filtering. Sahami et al. applied Bayesian approach to filter s [11]. Their approach considers a set of problem-specific features such as the domain type of the senders. For example, it is unlikely that s are sent from.edu domains. Carreras and Marquez proposed a method using boosting techniques to improve the performance [4]. Their approach makes use of the AdaBoost algorithm to discover a set of weak rules and then combines several weak rules to form accurate rules for filtering s. Androutsopoulos et al. investigated the effect of the size of feature set, the size of training set, the use lemmatization, and the use of stop words in training naïve Bayesian classifiers for filtering [1]. Sakkis et al. employed memory-based learning method, which corresponds to the K-nearest neighbour algorithm, for the task [12]. Different factors such as the value of K, the features adopted, etc. have been investigated in their works. A detailed comparison between naïve Bayesian and memory-based methods has been reported in [2]. Statistical techniques have also been employed to train document classifiers for filtering [3]. One characteristic of these document classification techniques is that they can effectively handle the uncertainty involved in the problem. Besides document classification techniques, rule induction techniques have been proposed for filtering s. Cohen proposed a method for learning filtering rules from a set of training examples [5]. The rules are keyword-spotting rules that classify s based on keywords contained in different fields of s. Hidalgo et al. decided and developed a system with a set of heuristics for detecting s [9]. Such heuristics include some special characters, for example, #,!, contained in s. These methods are mainly based on keyword matching in s. An advantage of applying filtering rules to filtering is that they are fast and effective. However, the rules are quite restrictive and make hard decision only in filtering. 3. Document Classification 3.1. Multinominal Naïve Bayesian α c Figure 2. A generative model for generating s. Spam filtering can be considered to be a document classification problem, in which a document refers to an and classes include and ham . An can be considered as a bag of words. The generation process can be represented by the generative model depicted in Figure 2. This model can represent the dependence of different variables. Shaded and unshaded nodes correspond to observed and unobserved variables respectively. A block refers to the repetition of the variables inside. In the model, a multinominal parameter α specifies a multinomial distribution which generates the class of an c. Together with another parameter β, c describes another multinomial distribution which generates the N words w contained in an . Suppose a collection contains M β w N M 3900

3 s, the generation process repeats M times. Since the s are given in filtering, w is an observed variable in the model. However, α, β, and c are unobservable because their values are hidden for an . Based on the generative model, the probability of generating an d i given the parameters α and β, is expressed as follows: Ni P( di α, β ) = P( c α) P( wk c; β ) (1) k = 1 where w k refers the k-th word in the . N i denotes the number of words of the i-th in the collection. Recall that our objective in filtering is to determine the class of a given . Essentially, we determine the probability for the class of the document d i belonging to the class c j given the document and the parameters, P(c=c j d i ;α,β). According to Bayes theorem, we have: Ni P( c = cj di; α, β ) P( c = cj α ) P( wk cj; β ) (2) k = 1 Once we obtain the values of P(c α) and P(w c,β), we can infer the class label of an . We determine the value of these probabilities as follows: Let V be the vocabulary of words contained in all s. Then we have: 1+ f ( i= 1 P( c = cj α ) = (3) M + c where M is the number of s in the collection; f(d i,c j ) is an indicator function which equals to 1 if document d i is labeled as class c j, and equals to 0 otherwise; c denotes the number of classes in the underlying problem. In filtering c is equal to 2 because the classes include ham and . Next, we have: M M 1+ g( wk, i= 1 P( wk cj; β ) = (4) V M V + g( ws, s= 1 i= 1 where V denotes the size of the vocabulary and g(d i,w k,c j ) is an indicator function which equals to 1 if d i is labeled as c j and contains the word w k, while equals to 0 otherwise K-Nearest Neighbour K-nearest neighbour (K-NN) is a classification technique based on similarity measure between instances and has been applied in document classification. In filtering, each can be represented by a vector d k The t-th entry of d i, denoted by w it, refers to the weighting of the t-th word in the vocabulary V in the . We define the similarity sim(d p,d q ), known as the cosine similarity, between two s d p and d q as follows: dp dq sim( dp, dq) = (5) dp dq where d refers to the norm of the vector d. Suppose we have a collection of s denoted by D. If we want to determine the class label of the d i, we first compute the similarity of d i between each of the s in D. Next the K most similar s will be identified. The score of d i belonging to class c j is defined as follows: score ( = sim( dp) y( dp, (6) dp KNN where KNN refers to the set of K-nearest neighbours in D, and y(d p,c j ) is an indicator function which equals to 1 if d p is labeled as c j, and equals to 0 otherwise. Finally, the class labeled of d i is decided as follows: c = max{ score( } (7) c j 4. Incorporating Keyword-Based Filtering in Different Classifiers Keyword-based filtering is commonly applied to existing commercial filtering system because it is fast and effective. The idea of keyword-based filtering is to check if an contains words. If an contains any words, it will be considered to be a . Table 1 depicts a list of words. This list is also used in the experiments in this paper. Though keyword-based filtering is effective, it is ad-hoc, makes hard decisions, and is unlikely to handle uncertainty. As a consequence, we incorporate keyword-based filtering in multinominal naïve Bayesian and K-NN classifiers to improve the effectiveness and yet handle the uncertainty involved for filtering. Table 1. A list of words for s filtering gamble gambling casino poker pill drug medicine viagra sex fuck suck porn lonely girl penis nude adult lesbian virus 100% earn money credit card cheap cash insurance free undeliverable mail 3901

4 Recall that a multinominal naïve Bayesian classifier is characterized by a set of probabilities P(c α) and P(w c;β). P(w c;β) can be interpreted as how likely an of class c will contain the word w. For example, if a number of s contain a word, such as Viagra, while only a few ham s contain this word, P(w= viagra c=;β) will be significantly higher than P(w= viagra c=ham;β). We incorporate keyword-based filtering to naïve Bayesian filtering based on this idea. For each of the words listed in Table 1, we set P(w=w c=;β) to a higher value and reduce the values of P(w c=;β) for other words accordingly; while set P(w=w c=ham;β) to a smaller value and increase the values of P(w c=ham;β) for other words accordingly, where w is a word. To incorporate keyword-based filtering in K-NN, we artificially create s, each of which only consists of a word. Each document is represented by a vector in which each entry is the weight for a word in the vocabulary V. Next, we set the weight for the word in the artificial to a higher value. These documents are labeled with s. To determine the class of an d i, the similarity between d i and all the s in D, as well as the similarity between d i and the artificial s are calculated. The class label is then determined according to Equations 6 and 7. Consequently, an that contains words will be more likely to be classified as s by using multinominal naïve Bayesian or K-NN classifiers alone. On the other hand, it can soften the decision made by using only keyword-based filtering. 5. Experimental Results We evaluate our proposed methods in the TREC 2005 Spam Corpus provided by Text Retrieval Conference. It consists of s in which about 57% are s. As discussed in Section 1, about 70% of s to an account are. To simulate the real situation, we randomly select s to construct the training set and s to construct the testing set. Both training set and testing set contain 70% of s and 30% of ham s. Each is preprocessed by removing the stop words and conducting stemming to each word. We also use Chi-Square method for feature selection and the resulting features contain about 400 words. For K-NN, K is set to 5 and we adopt the commonly used term frequency-inverse document frequency (TF-IDF) to calculate the weight for each word. After that, four sets of experiments were conducted for evaluation. The first and second sets of experiments apply multinominal naïve Bayesian and K-NN approaches respectively for filtering s. These two sets of experiments can be treated as the baselines of our methods of incorporating keyword-based filtering. The third and fourth sets are to incorporate keyword-based filtering to multinominal naïve Bayesian and to K-NN as described in Section 4 respectively. The words are listed in Table 1. We adopt macro F-measure (F macro ) and micro F-measure (F micro ) as the evaluation metrics in our experiments. However, one problem of F-measure is that it does not consider the costs for different types of misclassification. As exemplified in Section 1, normally the cost of classifying a ham to a is relatively higher. As a result, we use another cost-sensitive metric for our evaluation. Let λ be a parameter denoting the cost of classifying a ham to in relation to the cost of classifying a to a ham . For example, λ=9 if classifying a ham to a is 9 times more expensive than classifying a to a ham . Next, we define the weighted error rate (Err λ ) as follows: λ nham + n ham Errλ = (8) λ N + N ham where n ham and n ham refer to the number of s incorrectly classified to s and the number of s incorrectly classified to ham respectively. N ham and N refer to the total number of ham s and the total number of s respectively. We also define the baseline error (Err λ b ) as follows: N b Err = λ λ N + N (9) ham Err b λ can be interpreted as the error for the case that when there is no filters and all the s are considered as ham . Next the total cost ratio (TCR λ ) is a single value presenting the comparison to the baseline: b Errγ N TCRγ = = (10) Err λ n n γ ham + ham The higher the TCR λ of an filter, the better its performance is. If TCR λ is less than 1, it means that the filter is worse than the case that no filters are present. 3902

5 Table 2. The experimental results of different approaches in filtering evaluated using TCR λ. Bayes and K-NN refer to the multinominal naïve Bayesian and K-nearest approaches respectively. Bayes++ and K-NN++ refer to the approaches of incorporating keyword-based filtering to multinominal naïve Bayesian and K-NN respectively. Bayes K-NN Bayes++ K-NN++ TCR TCR Table 2 shows the experimental results of different approaches for the cases when λ=1 and λ=9 respectively. Bayes and K-NN refer to multinominal naïve Bayesian and K-nearest neighbour approaches respectively. Bayes++ and K-NN++ refer to the approaches that incorporate keyword-based filtering to multinominal naïve Bayesian and K-NN respectively. Recall that the higher the value of TCR λ, the better the performance of the filter, and the higher the value of λ, the higher the cost of misclassification of a ham to . It can be observed that when the cost of misclassification of a ham to a and that of misclassification of a to ham are the same (i.e. λ=1), the performance of incorporating keyword-based filtering is just comparable to the performance of not incorporating keyword-based filtering. However, when we set a higher cost for the misclassification of a ham to a , TCR 9 of K-NN++ is 0.14 higher than TCR 9 of KNN; while TCR 9 of Bayes++ is 0.1 higher than TCR 9 of Bayes. Our approaches of incorporating keyword-based filtering outperform the approaches which do not incorporate keyword-based filtering. Furthermore, it can be observed that multinominal naïve Bayesian is more sensitive to the cost of misclassification of a ham to a . When λ is increased from 1 to 9, the value TCR λ is changed from 8.89 to 3.21 and 8.84 to 3.31 for including and excluding keyword-based filtering respectively. On the other hand, K-NN is more stable with respect to the cost of misclassification. Table 3 shows the experimental results of different approaches evaluated using F-measure. Recall that F-measure (F macro and F micro ) sets the same cost to the misclassification of a ham to , and the misclassification of a to ham . It can be observed that the results of Bayes++ are similar to that of Bayes; while the results of K-NN are similar to that of K-NN++. Table 3. The experimental results of different approaches in filtering evaluated using F-measure. Bayes and K-NN refer to the multinominal naïve Bayesian and K-nearest approaches respectively. Bayes++ and K-NN++ refer to the approaches of incorporating keyword-based filtering to multinominal naïve Bayesian and K-NN respectively. Bayes K-NN Bayes++ K-NN++ F macro F micro Conclusions and Future Work We propose approaches for formally incorporating keyword-based filtering, which is fast and effective but is considered ad-hoc, to document classification techniques, in particular, multinominal naïve Bayesian and K-nearest neighbour. One property of our approaches is to soften the hard decision made by keyword-based filtering. In order to effectively evaluate the performance of our approaches, we use a cost-sensitive evaluation method which considers different costs for misclassification of a ham to a and misclassification of a to a ham . Empirical results show that our approaches achieve better performance when setting a higher cost for misclassification of a ham to . We intend to extend our work in several directions. One possible direction is to automatically discover the words from s using machine learning approaches. Methods such as latent semantic indexing have been applied to clustering words in documents of similar topics. It is likely that we can employ similar technique to automatically discover the words from s. Another direction is to increase the expressiveness of the list of words. For example, words may be organized as an ontology and provide richer information to the filtering system. References [1] I. Androutsopoulos, J. Kout, K. Chandrinos, G. Paliouras, and C. Spyropoulos, An evaluation of naive Bayesian anti- filtering, Proceedings of the Workshop on Machine Learning in the New Information Age, pp. 9-17, [2] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. Spyropoulos, and P. Stamatopoulos 3903

6 Learning to filter A comparison of a naive bayesian and a memorybased approach, Proceedings of the Fourth PKDD s Workshop on Machine Learning and Textual Information Access, pp. 1-13, [3] A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B. Zupan, Spam filtering using statistical data compression models, Journal of Machine Learning Research, Vol 7, No. 12, pp , [4] X. Carreras and L. Marquez. Boosting trees for anti- filtering, Proceedings of the fourth International Conference on Recent Advances in Natural Language Processing, pp , [5] W. Cohen, Learning rules that classify , Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, pp , [6] L. F. Cranor and B. A. LaMacchia, Spam!, Communications of the ACM, Vol 41, No. 8, pp , [7] B. Davison, M. Najork, and T. Converse, Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, Technical Report LU-CSE , Department of Computer Science and Engineering, Lehigh University, 2006 [8] J. E. Dunn, Spam some figures on the threat, id=1372, [9] J. Hidalgo, M. Lopez, and E. Sanz, Combining text and heuristics for cost-sensitive filtering, Proceedings of The Fourth Conference on Computational Language Learning and the Second Learning Language in Logic Workshop, pp , [10] G. Mishne, D. Carmel, and R. Lempel, Blocking blog with language model disagreement, Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, pp. 1-6, [11] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, A Bayesian approach to filtering junk , Proceedings of AAAI-98 Workshop on Learning for Text Categorization, pp , [12] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. Spyropoulos, and P. Stamatopoulos, A memorybased approach to anti- filtering for mailing lists, Information Retrieval, Vol 6, No. 1, pp ,

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University