CHAPTER 4 CONTENT BASED FILTERING

Size: px

Start display at page:

Download "CHAPTER 4 CONTENT BASED FILTERING"

Jordan Warren
5 years ago
Views:

1 74 CHAPTER 4 CONTENT BASED FILTERING 4.1 INTRODUCTION Many anti-spam techniques have been proposed by researchers to combat spam, but no method provides a successful solution to reduce false positives and false negative rates. There are two most important methods available for combating spam viz. blocking the sources of spam sources and filtering spam according to content. Content based filtering is often applied to generate automatic filtering rules and to classify through machine learning approaches, such as NB, SVM, K-NN and ANN. These approaches usually analyze words, the occurrence and distributions of words and phrases in the content of s and used to filter the incoming CONTENT BASED SPAM FILTERING TECHNIQUES Content based spam-filtering techniques are largely based on the machine learning filters and provide a wide category of various techniques Machine Learning Filters The use of Bayes formula as a tool to identify spam was initially applied to spam filtering by Sahami et al (1998), Pantel and Lin (1998). Graham (2002, 2003) later implemented a Bayesian filter that caught 99.5% of spam with 0.03% false positives. Androutsopoulos et al (2000) established that a NB filter clearly surpasses keyword-based filtering, even with a very

2 75 small training corpus. Zdziarski (2004) has introduced Bayesian noise reduction to remove irrelevant text as a way of increasing the quality of the data provided to a NB classifier. Further filters generally use NB filtering which assumes that the occurrence of events is independent of each other i.e. such filters do not consider that the keywords special and offers are more likely to appear together in spam than in legitimate . SVMs are generated by mapping training data in a nonlinear manner to a higher-dimensional feature space, where a hyper plane is constructed which maximizes the margin between the sets. Drucker et al (1999) applied the technique to spam filtering and tested it across the text classification algorithms such as Ripper, Rocchio and boosting decision trees. Both boosting trees and SVMs provided acceptable performance, but SVMs are performed with lesser training requirements itself. SVM and K-NN approaches are normally subjective to noise, indicating that errors in the training set can easily induce misclassi cation (Joachims 1998). They also tend to be computationally intensive in terms of larger datasets. Clark et al (2003) construct a backpropagation trained ANN classifier named LINGER. Chhabra et al (2004) present a spam classifier based on a Markov Random Field (MRF) model. The inter-word dependence of natural language can therefore be incorporated into the classification process which is normally ignored by naive Bayesian classifiers Previous Likeness Based Filters Memory-based, or instance-based, machine learning techniques classify incoming s according to their similarity to stored examples (i.e. training s). Defined attributes form a multi-dimensional space, where new instances are plotted as points. New instances are then assigned to the majority class of its K-closest training instances(trudgian 2004), using the

3 76 KNN algorithm, which classifies the s. Sakkis et al (2000, 2001) used a k-nn spam classifier and compared with a naïve Bayesian classifier using cost sensitive evaluation and obtained favorable results Case Based Reasoning Filters Case-Based Reasoning (CBR) systems maintain their knowledge in a collection of previously classified cases, rather than in a set of rules. Incoming is matched against similar cases in the system s collection, which provide guidance towards the correct classification of the . The final classification, along with the itself, then forms part of the system s collection for the classification of future s. Cunningham et al (2003) construct a case-based reasoning classifier that can track concept drift. They propose that the classifier both adds new cases and removes old cases from the system collection, allowing the system to adapt to the drift of characteristics in both spam and legitimate mails. An initial evaluation of the classifier suggests that it outperforms naive Bayesian classification Complementary Filters Adaptive spam filtering (Pelletier et al 2004) targets spam by category. It divides an corpus into several categories, each with a representative text. Incoming is then compared with each category, and a resemblance ratio is generated to determine the likely class of the . The authors (Boykin and Roychowdhury 2005) identify a user s trusted network of correspondents with an automated graph method to distinguish between legitimate and spam s. The authors intend this filter to be a part of a more comprehensive filtering system, with a content-based filter responsible for classifying the remaining messages.

4 Ontology Based Filters The creation of an adaptive ontology which helps in classifying s is discussed by Youn and Mc Leod et al (2000). The motivation of this approach is that, it opens up a whole new aspect of classification on the semantic web. The classification accuracy can be improved initially by pruning the ontology tree and using better classification algorithms. The challenge was mainly to make J48 classification (Open Source Java Implementation of C4.5) outputs to RDF and fed it into Jena, i.e. interfacing two independent systems and creating a prototype that actually uses this information that flows from one system to another to get certain desired input. Its limitation includes that it expects the input as particular Comma Separated Values (CSV) format. Christian Hempelmann and Vikas Mehra (2008) propose OST(Ontological Semantic Technology) for semantic spam filtering by each category(spam /Ham) and produces a Text Meaning Representation (TMR) using word senses from a lexicon which are based on concepts and their properties captured in an ontology. The approach is computationally intensive and requires more comparisons. Balakumar and Vydehi (2010) proposed a method to create an classification filter to satisfy the user preference classification and avoids time consuming process. It uses ontology for understanding the content of the and Bayesian approach for making the classification and may suffer from the limitations similar to that of Bayesian classifiers. 4.3 EFFECTIVENESS OF MACHINE LEARNING FILTERS Machine learning variants can normally achieve effectiveness with less manual intervention and are more adaptive to continued changes in spam patterns. Further, they do not depend on any predefined rule sets analogous

5 78 with non-machine-learning counterparts. The overall utility of a classifier directly depends on the training set (Weiss and Tian 2006). The performance of numerous machine-learning approaches is also largely dependent on the size of the training set and training updates. The larger training set would produce better results although more processing time in terms of the learning process is normally required. Furthermore, the identification and selection of the best feature set introduces further challenges. 4.4 CONTENT ANALYZER METHOD The widespread use of machine filters and applying fuzzy logic methods has led to the various improvements on the accuracy and precision rate. The existing approaches suffer from more number of false positives and the important task of identifying spoofed mails have not attained. The vulnerabilities such as skipping spam keywords, replacing spam keywords with similar meanings which are not in the spam database and word obfuscation etc. of machine learning filters can be misused by spammers to enter into users Inbox. One of the approaches combined with it is the legitimate identities/trusted senders identities can be used to send spoofed s to draw the attention of the users and believe the content displayed, click any malicious URLs present in the and reply with the requested user credentials. Hence, it is proposed and tested with different spam filters including fuzzy Logic without semantic analyzer filter, fuzzy logic with semantic analyzer filter and user preference ontology filter which reduces false positive rate, false negative rate, improves accuracy and identifies white listed sender identity being misused to send spam s. All these methods commonly require extraction and stop word elimination using porter stemming algorithm and obtains keywords. The system architecture for Content Analyzer Method is illustrated in Figure 4.1.

6 79 Figure 4.1 System architecture for content analyzer method Extraction and Stop Word Elimination The extraction and stop word elimination module as in Figure 4.2 extracts the incoming mails to retrieve Fromaddr and Body using Text Extraction Algorithm and their contents are parsed. The parsed words are given as input to stemming s algorithm, from which the stop words are eliminated. Figure 4.2 Sequence diagram for extraction and stop word elimination

7 Naïve Bayes Filtering - Bayesian Approach Bayesian filtering has become the universally accepted approach of enterprise-scale filtering solutions. It uses an adaptive rule set such that the tokens and their associated probabilities are manipulated according to the user s classification decisions and the types of received. In this technique, each message is described by a set of attributes (e.g. words or phrases). Probabilities are assigned to each attribute based on its number of occurrences in the training corpus. Bayes Theorem: Bayes theorem provides a way to calculate the probability of a hypothesis, for the event Y, given the observed training data, represented as X: P(Y X) = ( ) ) ) (4.1) It is often easier to calculate the probabilities, ( ) P(X), P(Y) for the probability ( ) that is required. It consists of four major modules, each responsible for four different processes: message tokenization, probability estimation, feature selection and Naive Bayesian classification as shown in Figure 4.3.When a message arrives, it is tokenized into a set of features (tokens), F. Every feature is assigned an estimated probability that indicates its spam nature. In order to reduce the dimensionality of the feature vector, a feature selection algorithm is applied to output a subset of the features, F 1 F. The Naive Bayesian classifier combines the probabilities of every feature in F 1, and estimates the probability of message being spam.in terms of a spam classifier, Bayes theorem (Equation (4.1)) can be expressed as P(C F) = ( ) ) ) (4.2)

8 81 where F= {f 1, f2...f n } is a set of features and C= {good, spam} are the two classes. When the number of features, n, is large, computing P (F C) can be time consuming. With reference spam filtering C refers Ham and Spam. Figure 4.3 Model for naive bayesian spam filtering It is also assumed that the features which are usually words appearing in the message are independent of each other; but it fails in some conditions such that the word Viagra is likely to co-occur with purchase. Using the Naive independence assumption of the NB classifier, according to the joint probability for all n features can be obtained as a product of the total individual probabilities as P(C F) = ( ) (4.3) Inserting (4.3) into (4.2) yields P(C F) = ) ( ) ) (4.4) The denominator P (F) is the probability of observing the features in any message and can be expressed as

spam divided by the probability of observing the features in any message. Bayesian spam filtering which is best suitable for spam detection finds a vital role in identifying phishing mails also.

9 82 (4.5) Inserting (4.5) in (4.4) the formula used by the Naive Bayesian Classifier is obtained (4.6) If C=spam then (4.6) can be interpreted as: The probability of a message being spam, given its features equals the probability of any message being spam multiplied by the probability of the features co-occurring in a spam divided by the probability of observing the features in any message. Bayesian spam filtering which is best suitable for spam detection finds a vital role in identifying phishing mails also. Naïve Bayesian is a text classifier algorithm that analyzes textual features of an to identify it as a ham or spam or phish based on probabilistic scoring of its textual attributes as depicted in Figure 4.4. Figure 4.4 Naïve bayes classifier for spam/phishing s

10 83 An incoming is first tokenized to get individual tokens. The corresponding probabilities for each token are retrieved. The knowledge base is trained with large input mails(both spam and ham).this training process involves calculation of n-grams and its respective values in the Spam/Phishing and Ham database( i.e. number of occurrences of n-grams in spam/ham database) for each keyword. The spam/ham probability is calculated using the Bayesian formula as shown in the Bayesian Algorithm Fuzzy Logic Without Semantic Analyzer (FWSA) FWSA is based on fuzzy similarity approach which can automatically classify messages as spam or legitimate by taking the content of the message into its consideration and adapt its decision accordingly. The system can adapt to spammer tactics and dynamically build its knowledge base. Fuzzy logic deals with fuzzy sets that allow partial membership in a set. Using a fuzzy similarity approach (El-Sayed et al 2008), a classification model is built from a set of pre-classified instances System Architecture spam filtering system takes place through three phases pre-processing, training set generation and classification. Figure 4.5 shows how the spam mail is filtered based on the content of message text. Tokens from all the messages are combined into one vector. Numbers of occurrence of each token in each category is calculated and the fuzzy token-category relation is defined. The token with the maximum number of occurrences will be assigned a value of 1, and all other tokens will be assigned proportional values. Membership degree of token in each category is calculated using fuzzy token frequency. Each message is filtered based on calculating fuzzy

11 83 An incoming is first tokenized to get individual tokens. The corresponding probabilities for each token are retrieved. The knowledge base is trained with large input mails(both spam and ham).this training process involves calculation of n-grams and its respective values in the Spam/Phishing and Ham database( i.e. number of occurrences of n-grams in spam/ham database) for each keyword. The spam/ham probability is calculated using the Bayesian formula as shown in the Bayesian Algorithm Fuzzy Logic Without Semantic Analyzer (FWSA) FWSA is based on fuzzy similarity approach which can automatically classify messages as spam or legitimate by taking the content of the message into its consideration and adapt its decision accordingly. The system can adapt to spammer tactics and dynamically build its knowledge base. Fuzzy logic deals with fuzzy sets that allow partial membership in a set. Using a fuzzy similarity approach (El-Sayed et al 2008), a classification model is built from a set of pre-classified instances System Architecture spam filtering system takes place through three phases pre-processing, training set generation and classification. Figure 4.5 shows how the spam mail is filtered based on the content of message text. Tokens from all the messages are combined into one vector. Numbers of occurrence of each token in each category is calculated and the fuzzy token-category relation is defined. The token with the maximum number of occurrences will be assigned a value of 1, and all other tokens will be assigned proportional values. Membership degree of token in each category is calculated using fuzzy token frequency. Each message is filtered based on calculating fuzzy

84 similarity measure between received message and each category.

each category in a pre-classified set of email messages.

content and subject. Each sample message is labeled with a specific category.

5 Fuzzy logic based e-mail spam filtering architecture Decision is made based

However, if false positive is more serious than false negative, some threshold

12 84 similarity measure between received message and each category. During the training phase, a model is built based on the characteristics of each category in a pre-classified set of messages. The training dataset should be selected in such a way that it is varying in content and subject. Each sample message is labeled with a specific category. Figure 4.5 Fuzzy logic based spam filtering architecture Decision is made based on the threshold value and ratio of the similarity measures. However, if false positive is more serious than false negative, some threshold value should be maintained where >1, to take decision based on the ratio of the similarity measures. The choice of depends on the user s personal preference for the trade-off between false positive and false negative.

13 Pre-processing Before messages in the given corpus are used for training and classification, some preprocessing needs to be done in order to reach optimum results. First, all HTML tags are stripped off; then, all stop words, i.e. words that appear frequently but have low content discriminating power such as a, an, and, the, and for are removed from each message. The message is then tokenized into a set of strings separated by some delimiters, e.g. whitespaces. These tokens (or terms) can represent words, phrases or any keyword patterns. All mixed-case tokens are converted to lowercase. The resulting set of tokens is stemmed to their roots to avoid treating different forms of the same word as different attributes thus reducing the size of the attribute set. If a token appears few times in either category, it is removed. Now, tokens from all messages are combined into one vector T = <t 1, t 2 t N > where N is the total number of tokens. The number of occurrences of each token, t i i, in each category, c {spam, legitimate}, is also determined Training Set Generation During the training phase as illustrated in Figure 4.6, a model is built based on the characteristics of each category in a pre-classified set of e- mail messages. The training dataset should be selected in such a way that it is varying in content and subject. Each sample message is labeled with a specific category. Firstly, perform the pre-processing to extract tokens and determine the number of occurrences of each token in each category. Let C = {spam, legitimate} represent the category set, T denote the set of tokens, and fi, c denote the frequency of token t i in category c. From these data, we define a fuzzy token-category relation which maps each element in T C to a membership value between 0 and 1, i.e. R: T C - [0, 1]. Calculate membership degree of token t i in category c j.

occurrences of token t i in all Category 4.4.2.4 Classification Spam filtering is based on calculating fuzzy similarity measure between received message d and each category.

14 86 Figure 4.6 Block diagram for training set generation (4.7) where, µ R (t i,c j ) - Membership degree of token t i in category C j f i c j - Frequency of occurrences of token t i in a specific category Cj f i, legitimate +f i, spam - Frequency of occurrences of token t i in all Category Classification Spam filtering is based on calculating fuzzy similarity measure between received message d and each category. In order to calculate fuzzy similarity, first determine membership degree of each token to the message d. Firstly determine the frequency of each token in the message. (4.8) Fuzzy similarity measure is calculated using fuzzy conjunction operator and fuzzy disjunction operator. Decision is made based on the threshold value and ratio of the similarity measures. (4.9)

15 87 Set threshold value, where > 1 If ) ) > then d is spam where, is fuzzy conjunction operator (t-norm). is fuzzy disjunction operator (t-co norm). SM is fuzzy similarity measure between the received message d and each category, i.e. spam or legitimate. There are various methods to compute fuzzy conjunction and disjunction as shown in the Table 4.1. Table 4.1 Different fuzzy conjunction (T-norm) and disjunction (T-co norm) operators Einstein product = xy/([2 (x + y xy)) Algebraic product = xy T-Norm(x,y) Hamacher product = xy/([x + y xy]) Minimum = min{x, y} min{x, y} if max(x, y) = 1 Drastic product = 0 otherwise Bounded Difference = max{0, x + y 1} Einstein sum = xy/([1 + xy)) T-Conorm(x,y) Algebraic sum = x + y xy Hamacher sum = [x + y 2xy]/([1 xy)) Maximum = max(x, y) max{x, y} if min(x, y) = 0 Drastic sum = 1 otherwise Bounded Difference = min{1, x + y}

16 Fuzzy Logic with Semantic Analysis(FSA) In the existing system which uses bayesian classifier, it depends on the predefined threshold and may suffer from more false positive rates. It may also have high false negative rates if spammers replace spam keywords with alternate meanings that do not exist in the spam filter database. Those mails may escape from the filter that uses Bayesian Classifier. In the proposed system as in Figure 4.7, the alternate words are being identified as each and every word are sent to the lexical database for their alternate meaning which are then being updated in the database so that the filter may identify those words again in future without missing them. Figure 4.7 System Architecture of Fuzzy Logic with Semantic Analyzer

17 89 Identifying the words with its semantic meaning, it needs a lexical database to retrieve alternate meanings. Thus Wordnet 5 acts as a lexical database which provides with synonyms and hypernyms when the keywords are passed to it. All classifiers requires updated database with efficient training for better performance. Hence the database is trained with large input mails (both spam and ham). The training process involves calculating number of occurrences of words in spam/ham mails and then calculating the probability of occurrences in spam/ham for each keyword from the first phase. In the next step, the words that are changed with their alternate meaning by the spammers are identified with the help of wordnet using which the synonyms are extracted and the training database is updated with their alternative meanings and the spam mails are filtered as mentioned in FWSA User Preference Ontology filter Ontologies have played a critical role in the arena of the semantic web and provide a formal basis for the definition of knowledge representation, subsequently enabling the exchange of the knowledge easily. Ontologies are used to describe specialized domains and an associated set of vocabularies. Semantics relate to the ability to portray and understand the meaning of information in an expressive way. Their combination in the context of spam filtering facilitates the definition and understanding of spam in a better and formal way. Youn and McLeod (2009) argue that the application of ontologies to formalize spam offers an improved method for the filtering of spam, in the context of reflecting user preferences more appropriately. Yin et al (2007) presented a different approach by emphasizing the advantages of a multipronged approach, comprising globally trained datasets for generalization and personalized equivalents for specialization. Jongwan Kim et al (2007) have stated that ontological knowledge is built by identifying and formalizing the relationship between user choices and 5 WordNet is a large lexical database of English with nouns, verbs, adjectives and adverbs grouped into sets of cognitive synonyms each expressing a distinct concept.

18 90 how spam is reacted to it. All the above said methods do not show its significance in identity theft issue by sending spoofed s. Whatever may be the spam filtering technique either content based or path based the spammers techniques are adaptive and the most prominent vulnerability is that the white listed senders or legitimate senders identities are misused to send spam s. It makes the novice end users to trust the malicious content/url embedded in the and becomes prone to various attacks. The spoofed mails can be identified to some extent and it may be placed under suspicious category to be analyzed further after analyzing the header. The header analysis also means that the suspicious if forwarded again to the peer group through a legitimate identity it does not place a suspicious flag and deems to be a legitimate . Hence, it becomes the burden of the end user to trust or distrust the incoming mails from legitimate identities. The suspicious mail called Grey mail does not exhibit sufficient traits for establishing a degree of confidence that it is either spam or ham which make the study of ontology-based approaches a motivating consideration. In fuzzy logic, since each and every word is being taken to lexical database for their alternate meanings, it requires high execution time. Hence in order to avoid this, a novel filtering method is proposed based on ontology which finds frequently occurring words as keywords for further filtering process. This ontological based approach works based on user preference to keep track of white listed senders which increases its accuracy. Figure 4.8 shows the sequence diagram of this phase. The stemmed words from the stemming algorithm are redirected to Ontofilter where the calculation of frequency of words is done. The most frequent words are then taken to WordNet for their alternate meaning. The alternate words are compared and matched against the created ontology and then the probability is calculated. Mails are filtered and sent to the user s account based on this value. The extracted words are passed through WordNet, an online resource

19 91 where the nouns, verbs, adjectives and adverbs are grouped into a set of cognitive synonyms (synsets), which expresses a distinct concept. Figure 4.8 Sequence diagram for User Preference Ontology filter In this process, three sample Ontologies on earth, newspaper and book were created. These ontologies are implemented using tree structures. The structures are built based on classes and sub classes relationship among the words related to the domain of ontology. The extracted keywords from text extraction algorithm are taken as input for ontology filtering. The minimum support value is fixed and the occurrence of keywords which exceeds this value is sent to the WordNet for the semantic analysis process. In this process, the alternate meaning for each keyword is found and compared against the created ontology. Finally, probabilities are calculated based on the results. Since the ontology technique is done based on user preference, the values that are calculated above are checked for their user s choice. If it varies, those mails are filtered in the mail server.

20 92 In an organization, every user is placed under specific domain by defining a specific ontology and is permitted to send s only under the particular category. For instance, User 1, User 3 are placed under the domain called EARTH, an ontology with a set of keywords is built and specified. Mails from those users (USER1 and USER3) are considered to be legitimate if and only if they contain keywords from the ontology EARTH as illustrated in Figure 4.9. Figure 4.9 EARTH domain ontology Similarly, other ontologies such as BOOK can be specified for User 2, User 4 and NEWSPAPER can be specified for User 5 and User 6. The incoming mails are preprocessed to extract the features and obtain all the keywords. The spam classification algorithm is implemented using FP-Tree algorithm an improvement to Apriori algorithm a form of associative classifiers.

21 Associative classifiers Associative classification is a combination of two data mining problems, association and classification, that uses association rules to predict a class label. It is based on the principle that several documents containing the same frequent term association belong to the same class. It has attained popularity due to the following reasons such as the classifier can handle feature spaces of tens of thousands dimensions, while decision tree classifiers are limited to several hundred attributes only. The classifier is also able to consider terms together as a group, which relaxes the independence assumption methods like NB. Associative classifiers have been reported to be more accurate than decision tree techniques (Xindong Wu et al 2008). They have also become more efficient with the invention of new methods for frequent pattern mining (Jiawei Han et al 2004) an improvement to Apriori algorithm and it is an underlying operation for classification rule generation. Apriori is a seminal algorithm for finding frequent itemsets using candidate generation (Agrawal and Srikanth 1995). It is characterized as a level-wise complete search algorithm using anti-monotonicity of itemsets, if an itemset is not frequent, any of its superset is never frequent. The Apriori achieves good performance by reducing the size of candidate sets but in cases with many frequent itemsets, large itemsets, or very low minimum support, it still suffers from the cost of generating a huge number of candidate sets and scanning the database repeatedly to check a large set of candidate itemsets (Bayardo1998). Overcome these disadvantages, a compact data structure a frequentpattern tree (FP-tree) is constructed which is an extended prefix-tree structure for storing crucial, quantitative information about frequent patterns. Efficiency of mining is achieved with three techniques such as a large

22 94 database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans Frequent-pattern tree: design and construction: Let I = {a 1, a 2,..,...,a m } be a set of items, and a transaction database DB=<T 1,T 2.T N > where T i (i [1...n]) is a transaction which contains a set of items in I. The support (or occurrence frequency) of a pattern A, where A is a set of items, is the number of transactions containing A in DB(database). A pattern A is frequent if A s support is not less than a predefined minimum support threshold,.given a transaction database DB and a minimum support threshold, the problem of finding the complete set of frequent patterns is called the frequent pattern mining problem. Designing a compact data structure for efficient frequent-pattern mining, consider the transaction database DB as in Table 4.2. The first column reveals the id from the particular sender and the second column contains the keywords in the and the third column says the frequently present keywords in the with the minimum support threshold to be 3 (i.e., = 3). A compact data structure can be designed based on the following observations: 1. The frequent items will play a role in the frequent-pattern mining, it is necessary to perform one scan of transaction database DB to identify the set of frequent items. 2. If the set of frequent items of each transaction can be stored in some compact structure, it may be possible to avoid repeatedly scanning the original transaction database. 3. If multiple transactions share a set of frequent items, it may be possible to merge the shared sets with the number of occurrences registered as count. It is easy to check whether two sets are identical if the frequent items in all of the transactions are listed according to a fixed order.

23 95 Table 4.2 Transaction database TID Keywords present in the (Ordered) frequent keywords 1 Fossil fuels, Earth, Natural resources, Land, Biosphere, Energy, Wind, Woods 2 Earth, Natural and environmental hazards, Natural resources, Fossil fuels, Cloud, Wind, Water. 3 Natural and environmental hazards, Fossil fuels, Food, Biosphere, Water. 4 Natural and environmental hazards, Natural resources, Rain, Snow, Woods. 5 Earth, Fossil fuels, Natural resources, Weather, Cloud, Woods, Wind, Fuel oil. Fossil fuels, Natural resources, Earth, Wind, Woods. Fossil fuels, Natural resources, Earth, Natural and environmental hazards, Wind. Fossil fuels, Natural and environment hazards. Natural resources, Natural and environmental hazards, Woods. Fossil fuels, Natural resources, Earth, Wind, Woods. First, a scan of DB derives a list of frequent items, <(Fossil fuels :4), (Natural resources:4), (Earth:3), (Natural and environmental hazards:3), (Wind:3), (Woods:3)> (the number after : indicates the support) in which items are ordered in frequency descending order. This ordering is important since each path of a tree will follow this order. Second, the root of a tree is created and labeled with null. The FP-tree is constructed as follows by scanning the transaction database DB the second time. 1. The scan of the first transaction leads to the construction of the first branch of the tree: <(Fossil fuels:1), (Natural resources:1), (Earth:1), (Wind:1), (Woods:1)>. Notice that the frequent items in the transaction are listed according to the order in the list of frequent items. 2. For the second transaction, since its (ordered) frequent item list < Fossil fuels, Natural resources, Earth, Natural and environment hazards, Wind> shares a common prefix <Fossil fuels, Natural resources, Earth> with the existing path <Fossil

24 96 fuels, Natural resources, Earth, Natural and environment hazards, Wind, Woods> the count of each node along the prefix is incremented by 1, and one new node (Natural and environment hazards:1) is created and linked as a child of (Earth:2) and another new node (Wind:1) is created and linked as the child of (Natural and environment hazards:1). 3. For the third transaction, since its frequent item list <Fossil fuels, Natural and environment hazards> shares only the node <Fossil fuels> with the f -pre x subtree, f s count is incremented by 1, and a new node (natural and environment hazards :1) is created and linked as a child of ( fossil fuels :3). 4. The scan of the fourth transaction leads to the construction of the second branch of the tree, <(Natural resources:1), (Natural and environment hazards :1), (woods:1)>. 5. For the last transaction, since its frequent item list <Fossil fuels, Natural resources, Earth, Wind, Woods> is identical to the first one, the path is shared with the count of each node along the path incremented by 1. In order to facilitate tree traversal, an item header table is built in which each item points to its first occurrence in the tree through a node-link. Nodes with the same item-name are linked in sequence through such node-links. The tree, together with the associated node links after scanning all the transactions are shown in Figure 4.10.

25 97 Figure 4.10 FP-tree construction f = Fossil fuels, c = Natural resources, a = Earth, b = Natural and environmental hazards, m = Wind,p = Woods The FP-tree construction takes exactly two scans of the transaction database. The first scan collects the set of frequent items, and the second scan constructs the FP-tree. The cost of inserting a transaction T into the FP-tree is O( f(t) ), where f(t) is the set of frequent items in T. The FP-tree contains the complete information for frequent-pattern mining. According to the Apriori principle, the set of frequent item projections of transactions in the database is sufficient for mining the complete set of frequent patterns, because an infrequent item plays no role in frequent patterns. Patterns can be created depending upon the specific keyword, for instance the path with the keyword wood yields the result: <fossil fuels, natural resources, earth, wind, woods>. The probability P(word/category) to place a particular mail under a specific category depends upon the total no. of words extracted from the of a particular user towards the total no. of words in that category. P(word/category) = (4.11) If the probability of existence (P(word/category)) > Threshold (say 50%) accept the under the specific category mentioned against the

26 98 particular user, it reveals that the received mail claims to be from the legitimate whitelisted sender. If the probability of existence (P(word/category)) < Threshold (say 50%) place the under the suspicious category, it reveals that the received mail claims to be from the legitimate whitelisted sender is a spoofed mail with identity theft. 4.5 LEGITIMATE ATTACHMENTS Some of the spammers to escape from the content based filters send s without any content in the mail but attaching documents or containing links for malicious URLs rich in spam content. Attaching the legitimate file formats of MS-office such as.doc,.ppt,.html and PDF, spammers avoid contents analysis by spam filters and the message is passed which has got commercial advertisements with more spam keywords. On the other hand, it draws the user attention to open the attachment which usually does not have spam subject or content in order to pass the spam filter. In order to extract the contents from the Microsoft office documents, Apache POI is used which creates and maintains Java APIs for manipulating various file formats based upon the Office Open XML Standards (OOXML) and Microsoft s OLE 2 Compound Document format (OLE2). All of the POI text extractors extend from org.apache.poi.poitextextractor. This provides a common method across all extractors, gettext() Document Extraction All the POI text extractors extend from org.apache.poi.poitext Extractor. This provides a common method across all extractors, gettext( ). All POIFS/OLE 2 based text extractors also extend from org.apache.poi.poiole2textextractor. This additionally provides common methods to get at the HPFS document metadata. All OOXML based text extractors (available in POI 3.5 and later) also extend from org.apache.poi.

27 99 POIOOXMLTextExtractor. This additionally provides common methods to get the OOXML metadata. Use org.apache.poi.hwpf.extractor.wordextractor to return the text for the document for.doc files of the format Word 97 - Word Text Extraction A new class in POI 3.5, org.apache.poi.extractor.extractorfactory provides a similar function to WorkbookFactory. It can simply pass it an InputStream, a file, a POIFSFileSystem or an OOXML Package. It figures out the correct text extractor for you, and returns it Powerpoint Extraction The org.apache.poi.hslf.extractor.powerpointextractor is used for basic text extraction for Powerpoint files. It accepts a file or an input stream. The gettext() method can be used to get the text from the slides, and the getnotes() method can be used to get the text from the notes. Finally, gettext(true,true) will get the text from both PDF Extraction Fang Yuan and Bo Liu (2005) proposed a new method for extracting information from PDF files by parsing them to get text and format information and inject tags into text information to transform it into semistructured text. The extraction from the pdf files is done with a library called PDFBox. Apache PDFBox is an open source Java PDF library and allows creation of new PDF documents, manipulation of existing documents and the ability to extract. Extracting the contents from the attachments the techniques such as Bayesian filtering, fuzzy logic without semantic analyzer, fuzzy logic

28 100 with semantic analyzer and User preference based Ontology filter can be applied to filter spam s and obtain legitimate mails. 4.6 PERFORMANCE ANALYSIS The server is implemented with h-mail server and incorporated with Header based filtering, Content based filtering and URL Filtering. In the content based filtering, a sample of 550 s is fed into the classifier and results are obtained in six trials. Initially, Trial1 contains 20 legitimate and 80 spam mails. Spam mails are fed more in number to obtain more spam keywords which will be useful for training. Later, trial 2,trial 3,trial 4 and trial 5 are fed with equal no. of spam and legitimate mails as 50 spam mails and 50 legitimate mails. Trial 6 contains 20 legitimate and 30 spam mails. But cumulatively spam mails are higher than legitimate mails such as 310 spam mails and 240 legitimate mails. The above set of inputs is same for bayesian classifier, fuzzy logic without semantic analyzer, fuzzy logic with semantic analyzer and user preference based ontology filter Bayesian Classifier Bayesian Classification phase involves parsing of words as n-gram letters and their occurrences in the mail are found by comparing with the trained knowledge base. The spam/ham probability is calculated by using bayesian formula. An evaluation measure for all the above methods have been done based on the below formulas: Accuracy Rate = L S L S L S (4.12) Error Rate = S L L S L S (4.13) Recall = (4.14)

29 101 Precision = (4.15) where Recall/efficiency = Percentage of spam messages the filter can block (Efficiency) Precision = Degree to which the blocked mails are indeed spam. The above values in Tables 4.3 and 4.4 are illustrated by Figure Table 4.3 Test runs for Bayesian classifier Trial No No. of s Expected Results Legitimate Mails Spam Mails Obtained results L S S L Table 4.4 Performance measures for Bayesian classifier Features Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Accuracy FPR FNR Recall Precision

30 102 Bayesian Classifier Performance Measures Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Accuracy FPR FNR Recall Precision Figure 4.11 Performance measure of Bayesian classifier Referring Table 4.4, accuracy rate varies from 0.78 to 0.8 which states that more number of input yields less accuracy as they need more training. False positive rate varies from 0.3 to 0.4 and it fluctuates depending on the inputs and the knowledge base. Bayesian classifiers performance can be varied by adjusting the threshold value from 0.5 to 0.9 to obtain more legitimate mails into the user s inbox. False negative rate for Bayesian classifier varies from 0.09 to 0.19 due to large number of spam mails. Recall varies from 0.85 to Precision fluctuates from 0.77 to 0.9 and states that as more data increases it requires more training as the increase in input makes the knowledge base inefficient Fuzzy Logic without Semantic Analyzer The FWSA method identifies spam s based on the fuzzy similarity measure and the following results are obtained as in Table 4.5 and Table 4.6.The performance is as in Figure 4.12

31 103 Table 4.5 Test runs for FWSA Expected Results Obtained results Trial No. of No Legitimate Spam s Mails Mails L S S L Table 4.6 Performance measures for FWSA Features Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Accuracy FPR FNR Recall Precision Performance Measures Fuzzy Logic without Semantic Analyzer Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Accuracy FPR FNR Recall Precision Figure 4.12 Performance measure for fuzzy logic without semantic analyzer

32 104 Accuracy rate varies from 0.8 to 0.9 which states that more number of input yields less accuracy as they need more training. False positive rate varies from 0.1 to 0.19 and it is found that that the false positive rate fluctuates depending on the inputs and the threshold. False negative rate varies from 0.1 to 0.21 due to large number of spam mails. Recall varies from 0.79 to 0.9. Precision fluctuates from 0.84 to 0.97 and states that as data size increases it requires more training set as the increase in input makes the knowledge base inefficient Fuzzy Logic with Semantic Analyzer The fuzzy logic with semantic analyzer method identifies spam s based on the fuzzy similarity measure compared with WordNet. The results obtained are tabulated in Tables 4.7 and 4.8 and represented in Figure Table 4.7 Test runs for fuzzy logic with semantic analyzer Trial No No. of s Expected Results Obtained results Test for legitimate mails Legitimate Spam Mails Mails O-E (O-E) 2 /E L S S L TOTAL

33 105 Table 4.8 Performance measures for fuzzy logic with semantic analyzer Features Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Accuracy FPR FNR Recall Precision Performance Measures Fuzzy Logic with Semantic Analyzer Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Accuracy FPR FNR Recall Precision Figure 4.13 Performance measure for fuzzy logic with semantic analyzer Referring Table 4.8, accuracy rate varies from 0.91 to 1 which states that more number of input yields less accuracy as they need more training. False positive rate varies from 0 to 0.07 and it is found that that the false positive rate fluctuates depending on the inputs and the threshold. False negative rate varies from 0 to 0.1 due to large number of spam mails. Recall varies from 0.9 to 1. Precision fluctuates from 0.94 to 1 and states that as data size increases it requires more training as the increase in input makes the knowledge base inefficient.

34 User Preference Ontology Filter User preference ontology filter is used to filter the mails which match the user preferences specified by the users. The results obtained are tabulated in Tables 4.9 and 4.10 and represented in Figure Table 4.9 Test runs for user preference ontology filter Trial No No. of s Expected Results Obtained results Test for legitimate mails Legitimate Spam Mails Mails O-E (O-E) 2 /E L S S L TOTAL Table 4.10 Performance measures for user preference ontology filter Features Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Accuracy FPR FNR Recall Precision

35 107 Performance Measures Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Accuracy FPR FNR Recall Precision Figure 4.14 Performance measures for user preference ontology filter Accuracy Rate varies from 0.95 to 1 which states that more number of input yields less accuracy as they need more training. FPR varies from 0 to 0.04 and it is found that, the FPR fluctuates depending on the threshold. Hence, the user may not be able to view all their legitimate mails and some may be misclassified as spam. False negative rate for Bayesian Classifier varies from 0 to 0.06 due to large number of spam mails. Recall varies from 0.94 to 1. Precision fluctuates from 0.97 to 1 and states that as more data increases it requires more training as the increase in input makes the knowledge base inefficient. Chi-Square Test for Goodness of Fit: = ) (4.16) O Observed Frequencies E- Expected Frequencies

36 108 n -> No. The Degrees of Freedom = n-1 By Binomial Distribution Degrees of Freedom = n-1. Expected Frequency of n observation Degrees of Freedom = total entry 1. ie., r = 6 1 = 5 For r = 5, for 0.05% ( By default in the tabular column, the value is ) For fuzzy logic with semantic Analyzer and User preference Ontology filter the value of is < Hence the null hypothesis is accepted. 4.7 COMPARISONS WITH THE EXISTING SYSTEMS The various performance measures such as accuracy rate, false positive rate, false negative rate, recall and precision are used to compare the various techniques and tabulated in Table 4.11 and represented in Figure 4.15.

37 109 Table 4.11 Performance comparison of four methods Trial ID Accuracy FPR FNR Recall Precision Trial 1 Bayesian FWSA FSA Ontology Trial 2 Bayesian FWSA FSA Ontology Trial 3 Bayesian FWSA FSA Ontology Trial 4 Bayesian FWSA FSA Ontology Trial 5 Bayesian FWSA FSA Ontology Trial 6 Bayesian FWSA FSA Ontology

38 B F SA O B F SA O B F SA O B F SA O B F SA O B F SA O Acc FPR FNR Rec Prec Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 B Bayesian Classifier, F Fuzzy Logic without Semantic Analyzer SA Fuzzy Logic with Semantic Analyzer, O User Preference Ontology filter Figure 4.15 Performance comparison of four methods Accuracy Rate: The accuracy rate declines as the no. of input increases. It has also improved progressively from Bayesian classifier, fuzzy logic without semantic analyzer, fuzzy logic with semantic analyzer and user preference ontology filter. False Positive Rate: False positive rate varies from 0.4 to It is higher for Bayesian classifier which ranges from 0.3 to 0.4. It is much lesser for fuzzy logic with semantic analyzer with a range of 0 to False Negative Rate: False negative rate varies from 0 to 0.2. It is higher for Bayesian classifier which ranges from 0.15 to It is much lesser for fuzzy logic with semantic analyzer with a range of 0 to Recall: Recall ranges from 0.79 to 1 and it remains higher for fuzzy logic with semantic analyzer.

39 111 Precision: Precision ranges from 0.77 to 1 and it remains higher for fuzzy logic with semantic analyzer. 4.8 CONCLUSION Various existing model such as Bayesian classifier and fuzzy logic have been implemented to spam filtering. A trainable fuzzy logic based classification system based on fuzzy rules during the training phase has been applied and based on the fuzzy similarity model to classify spam and legitimate messages based on the content. It can adapt to spammer tactics and dynamically build its knowledge base. The machine learning filter suffers from word obfuscation and Bayesian poisoning fuzzy logic with semantic analyzer and user preference ontology filter is also implemented to identify the words that have been replaced by the spammers which avoid bypassing of spam mails from the filter. In order to avoid identity theft and identifying spoofed mails from legitimate identities in secure organization senders are categorized under specific domains. The system also handles attachments with extensions.txt,.doc,.pdf,.ppt,.html to extract the content alone and apply the spam filtering technique to detect spam mails.

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification