Spam Filtering using Contextual Network Graphs

Size: px

Start display at page:

Download "Spam Filtering using Contextual Network Graphs"

Millicent Wright
6 years ago
Views:

1 Abstract Spam Filtering using Contextual Network Graphs This document describes a machine-learning solution to the spam-filtering problem. Spam-filtering is treated as a text-classification problem in very high dimension space. Two new text-classification algorithms, Latent Semantic Indexing (LSI) and Contextual Network Graphs (CNG) are compared to existing Bayesian techniques by monitoring their ability to process and correctly classify a series of spam and non-spam documents. LSI and CNG algorithms have an advantage over the Naïve Bayes classifier in the domain of natural language processing as they include a representation of context, or relations between terms. Both LSI and CNG take advantage of these relations to offer a conceptual or semantic-based search, which has been adapted in this paper to the domain of spamfiltering. Contents 1. Introduction Aims Background Spam Spam Filtering Text Classification Spam Filtering as a Context-Heavy Text-Classification Problem Technology Naïve Bayes Classifier Latent Semantic Indexing Contextual Network Graphs Comparison of Algorithms Methods Source Material Application Structure Application Description Application Diagrams Class Descriptions Stop-word Removal Word Stemming Text Classification Measuring Success Storing and Retrieving Data Expected Results Results Data Discussion Conclusion Recommendations Bibliography...47

2 1. Introduction 1.1 Aims The aims of this project are to develop and test an application that applies three machine-learning text-classification algorithms, the Naïve-Bayes Classifier, Latent Semantic Indexing 1 and Contextual Network Graphs 2 to the field of spam-filtering. A number of pre-classified s are processed by the algorithms and the results of each algorithm s classification of the s are compared in order to determine which algorithm is the most successful and what conditions are required for each algorithm to succeed. Success will be measured primarily by the algorithms ability to correctly identify spam and nonspam s, as a function of time, or the number of s processed so far. Other measurements of success will be the time taken to process a single (both classification and assimilation of the new information it contains). It is expected that the amount of computation required to process an will increase as the amount of data increases for some algorithms. The amount of memory used by the algorithm for the purposes of data storage will also be included in the results. Testing the conditions required for the algorithms to succeed will include testing by number of s (corpus size) and various other values such as parameters used in the algorithms in order to determine their ideal values. Other factors that will be taken into account are the abilities of the various algorithms and their data structures to be distributed and run in parallel on different systems. The expected results are that LSI and CNG outperform the Naïve-Bayes Classifier in terms of classification after a certain amount of s have been processed. The primary purpose of the project is to compare the new CNG technology to the patented LSI approach in order to establish if CNG is a viable alternative to LSI in context-heavy domains such as spam-filtering. 1 Deerwester et al, Ceglowski et al, 2003 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 2/48

3 1.2 Background Spam The reception of overwhelming amounts of unsolicited and unwanted s (spam) is a problem that is experienced by almost every user world-wide. By 2004 the epidemic has become so widespread that in December 2003 BBC News estimated that 40% of all s sent are identified as spam, and that identifying and deleting spam costs UK businesses one hour per worker per day. A number of strategies have been tried or proposed to curb this problem. Currently, the proposed strategies can be classified into two different types, prevention techniques, and cure techniques. The success of spam is principally due to the fact that each message is sent in huge numbers, i.e. to a very large number of people, in the hope that a tiny proportion of those that receive the message buy the advertised product. The Wall Street Journal estimates that a response rate of.0001% is enough for the sender of the spam to turn a profit. Prevention techniques generally aim to stop spam from being sent in the first place, by implementing a number of controls and checks on the global e- mail system that will make it more difficult for spammers to send s in such large numbers. In the World Economic Forum in January 2004, Bill Gates desc ribed his suggestions for spam prevention, including forcing each computer sending an to perform a simple calculation. This would not affect users that send personal s only, but will be considerably more expensive for users that send bulk s to large numbers of recipients. Another suggestion Gates made was payment at risk. This would force senders to pay a charge each time one of their s was dismissed as spam. The main problem with prevention techniques is that they require the establishment of a set of controls on the system that would have to be run by corporations. In the case of the payment at risk option, a huge central clearing office would be required to process the payments. These requirements would mean that a corporation or set of corporations could theoretically gain control over the worldwide network. In general, the principle of freedom of the internet would oppose this solution. Other common attempts to solve the problem of spam can be considered cure techniques. The objective of these techniques is to stop spam messages from entering the inbox of their intended recipients after they have been sent. These can be implemented in the form of server- or client-side Daniel Kelleher Spam Filtering Using Contextual Network Graphs 3/48

4 applications that filter out unsolicited s using a number of rules that are either predefined or generated by a learning algorithm. The most common filter applications use keyword lists and ranges of blocked IP addresses to stop potential spam before it arrives in an inbox. So far, however, none of the proposed solutions has had much success, due to the complexity of the problem, both in terms of the ethics of harnessing and restricting a free system like the internet, and in terms of the logistics of keyword-based filtering. The prevention techniques fall outside the scope of this document; this project is concerned purely with the advancement of filtering ( cure ) techniques Spam Filtering Current spam filters come in a number of different forms 3, the most successful being rule-based classifiers such as the original SpamAssassin, and statistical classifiers using Bayesian probability techniques (SpamBayes, later versions of SpamAssassin etc.). Rule-based classifiers usually contain an extensive number of tests, each one associated with a score, that are carried out on an . If the fails a test, its score is increased. After all the tests have been applied, if the s score is above a certain threshold, it is classified as spam and discarded. Rule-based classifiers have the advantage of being able to employ diverse and specific rules to catch potential spam, such as checking the size of an or the number of pictures it contains, however this technique is non-machine-learning in the general sense and therefore rules have to be entered and maintained by hand. This is a considerable disadvantage, as learning algorithms such as Bayesian classifiers can derive rules as they receive more information, hence including rules that are too subtle or too complicated to be entered by hand. The Bayesian technique is described in section Text Classification In mathematical terms, text classification is the partitioning of a set of documents into a number of equivalence classes. Each equivalence class identifies the set of documents that belong to a document type. In the case of spam filtering, for example, there are two document types, spam e- 3 Mertz, 2002 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 4/48

5 mails and non-spam s. The job of spam filters is to create a partition of a document set containing s received in a user s inbox. Currently, the most widely-used text classification tool is the Naïve-Bayes classifier, that uses probabilistic reasoning to classify documents. It is described in detail in section Spam Filtering as a Context-Heavy Text-Classification Problem s, in general, are documents designed for human communication and therefore use natural language, inheriting all the advantages and disadvantages that it possesses. An advantage of natural language from a text-classification point of view is that the semantic structure of natural language gives rise to the presence of keywords; words that contain a large proportion of the semantic meaning of a sentence. These keywords can then be used for searching for or classifying documents. Keywords can be identified in a number of different ways. In rule-based spam filters, they are given by a list of predefined rules that can be changed by hand. For example, a rule may exist that states that the presence of the word Vicodin in an is an indicator that the message may be spam, or that messages that do not include a reply-to address are likely to be spam. In machine-learning text classifiers, the keywords are derived from the analysis of the document set. Annotated documents (the training set) are analysed, and the results of this analysis can be used in the classification of new documents. For example, if the word Vicodin appears more frequently in spam documents in the training set than in non-spam documents, it may be reasonable to assume that Vicodin is a keyword that is commonly used in spam messages, and its presence in new documents would be an indicator of a spam . On the other hand, a word with little semantic import, such as and, may have been found in relatively equal measure in both document classes in and will therefore have little effect on the classification of new documents. Using these techniques, keyword analysis can be a strong tool for text classification. However, natural language, being organic and evolving, is prone to phenomena such as polysemy and synonymy, that weaken the strength of keywords by introducing non-one-to-one relations Daniel Kelleher Spam Filtering Using Contextual Network Graphs 5/48

6 between words and meaning. A word can have several meanings, and the same semantic concept can be represented by several different words. Polysemy and synonymy can have detrimental effects on the accuracy of pure keyword-based classifiers. Polysemy can confuse a classifier as it allows the same word to be used in more than one context. For example, the word play is a highly polysemous word and can exist in a number of contexts. This means that play may not be a strong indicator of a specific class as it may be present in different classes in different contexts. Polysemy reduces the strength of keywords by increasing the number of classes in which a polysemous word can appear. If two words are synonymous, i.e. they have identical meanings, it is probable that they will both be used interchangeably in the same document class. This would, in turn, often result in a reduction of the frequency of each word. If the two words were replaced by a single word, thus removing the synonymy, this word would have a higher frequency than each of the two words, thus making it a more powerful keyword. Synonymy reduces the strength of keywords by spreading their value among a number of synonyms. A more powerful approach to classification of natural language texts is to use context-based searches that take into account the semantic links between words and search over a semantic space rather than simply a list of keywords. This helps to eliminate problems with polysemy and synonymy as the search or classification is based on semantic data instead of keywords. In terms of text-classification, different areas in the semantic space will belong to different classes. When a new document is classified, it is placed into the semantic space, and thus its class can be determined. The subsection of the semantic space that contains documents that belong to a specific class is the class space for the class. The amount of overlap between class spaces in the semantic space is an important factor in the level of success of a context-based classification. Classes with highly specified and compact domains, such as spam s, are easier to use in text-classification than those that have fewer restrictions, such as non-spam s. Spam s are usually quite homogeneous in their content. They have a limited subject domain. Non-spam s can relate to any subject, and are therefore harder to classify, but the domain of a single individual s will be a great deal smaller than the domain of all non-spam s worldwide. For this reason, context- Daniel Kelleher Spam Filtering Using Contextual Network Graphs 6/48

7 based classification works best if the knowledge base is derived from the contents of the inbox of an individual or a small group, as opposed to a global distributed system that uses the s of a large number of unrelated recipients. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 7/48

8 2. Technology 2.1 Naïve Bayes Classifier The Naïve-Bayes Classifier is probably the most popular and most successful text-classification tool used in spam filtering. It has the advantage over rule-based approaches in that it is a learning algorithm, i.e. the more documents it classifies, the more successfully it will classify new documents in the future. It operates under the general premise that a document can be classified based on the words it contains, and the product of the probabilities that each word will be found in documents of a specific class. These probabilities are calculated based on data previously received by the classifier, and are relative to the number of previous occurrences of the words in the set of documents belonging to the class. In spam filtering, there are only two classes, spam and non-spam, and the Naïve-Bayes Classifier determines the probabilities that words in a document belong to either the spam or non-spam classes. Based on these probabilities, the classifier can return a probability value for the entire document for each class. If the probability value for the spam class is higher than the value for the non-spam class, the document is classified as spam. After a document classification is verified (either by a human after the classification, or, in the case of this project, before, when using the pre-classified training set), the data contained in the document are added to the knowledge base of the classifier, in order to improve future classification. The fundamental problem with the Naïve-Bayes probabilistic approach to text classification is that it makes an assumption that each word in a document is independent from the others. This assumption is made for computational purposes, in order to reduce the amount of probabilities that need to be computed. It is this Independence Assumption that makes the Naïve-Bayes classifier unsuitable for natural language processing, since this assumption loses the contextual relations between words and fails to address issues such as polysemy and synonymy that introduce errors to text-classification. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 8/48

9 The Naïve-Bayes classifier attempts to learn a discrete valued function from a set of documents X onto a set of class values V (in the case of spam filtering, the set {spam, non-spam}). f : X V where each x X is a sequence of attribute values (in this case, words) <a 1, a 2,..., a n >. The definition of f(x) is as follows 4 : f ( x) = arg max v V ( P( x v j ). P( v j )) j = arg max v V ( P( a, a2,..., a j v ). P( v 1 n j j )) P(v j ) is calculated as the relative frequency of v j in the training set. P(a 1, a 2,..., a n v j ) is calculated by finding the product of the probabilities of finding each attribute in the documents of class v j. Therefore: P( a n 1, a2,..., an v j ) = P( ai v j ) i= 1 f ( x) = arg max v V ( P( a j n i= 1 i v j )). P( v j ) Note that this assumes the independence of each attribute value, a i. In general it would be too expensive computationally to determine the frequencies of a sequence <a 1, a 2,..., a n > in the set of documents belonging to a specific class and it is for this reason that the independence assumption is used, in order to split the members of the sequence. Intuitively, P a v i ) may be computed as the ratio of the frequency of a i in the document subset ( j 4 Mitchell, 1997 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 9/48

10 with class v j. However, this becomes inaccurate as the frequency of a i becomes very small, with the extreme case being a probability of 0 if the frequency of a i is 0. Instead, each P a v i ) is estimated using the 'm-estimate'of probabilities: ( j P( a i v j ) = nc + mp n + m where: n is the total number of words (not distinct) in the document subset with class v j n c is the number of occurrences of the attribute a i in these documents m is a constant sample size, in this case, the size of the vocabulary, or all distinct words found in all documents in the training set p is a prior estimation of the probability, in this case unknown, and assumed to be a uniform 1/m (i.e. a value inversely proportional to the size of the vocabulary). This gives: P( a i v j ) = n + n c + 1 Vocabulary Note that the absence of any information regarding the other words (or attributes) in the document is a result of the independence assumption that is required in order to use the Naïve-Bayes method, and is refuted in this project. The complexity of classifying a document is O(v.n) where n is the number of words in the new document and v is the number of classes. Assimilating a document is simple, and involves adding new words to the vocabulary and increasing the frequencies per class of the other words. The complexity of this procedure is also O(c.n). Daniel Kelleher Spam Filtering Using Contextual Network Graphs 10/48

11 2.2 Latent Semantic Indexing One of the difficulties in applying a text classification algorithm to the domain of spam filtering is the high dimensionality of the term - document space, due to the volume of the vocabulary present in the documents. Latent Semantic Indexing (LSI) solves this problem and derives the semantic context from the document set at the same time. The algorithm builds a term-document matrix X from the input documents, in which the value of each cell i, j of X corresponds with the number of occurrences of term i in document j. Singular Value Decomposition (SVD) is then performed on X in order to extract a set of linearly independent factors that describe the matrix. Certain factors have smaller effects than others and can be ignored, so that what remains is an approximation of the original matrix that includes generalisations over the data but removes slight fluctuations. These generalisations are the latent semantic information required to perform contextbased analysis. The issues of polysemy and synonymy are handled to a certain extent by using this technique, as the principal functors are no longer words, as in the Naïve Bayes classifier, but a combination of words and semantic data generated by the generalisation. SVD splits a term-document (t x d) matrix X into the product of three matrices T, S and D, where: T is a t x m matrix with orthonormal columns, representing the co-ordinate space for terms, S is an diagonal matrix with m values, ordered by size, and D is a d x m matrix with orthonormal columns, representing the co-ordinate space for documents. X = TSD' By decreasing the value of m, the sizes of the matrices decrease and result in the reduction of the dimensionality of the problem. The resultant matrices, T 1, S 1 and D 1 multiply to give X 1, the approximation of the original term-document matrix, complete with the exposed semantic data. X 1 = T S D ' X Daniel Kelleher Spam Filtering Using Contextual Network Graphs 11/48

12 Though the X 1 matrix is the same size as X, T 1, S 1 and D 1 are considerably smaller than T, S and D, and therefore take up less space in memory and require fewer computational operations to perform calculations on them. Two documents can be compared by finding the distance between two document vectors, stored as columns in the X 1 matrix 5. Instead of using X 1 however, the distance can be calculated using the smaller D 1 and S 1 matrices. Text classification is carried out by finding the nearest neighbours of a query document, and then determining the class of the query document based on a poll taken on these neighbours. Class( d) arg max ( neighbours c ) = c Classes The optimum size of the neighbour set varies with respect to the application of the algorithm and can be determined through testing. To classify a new document d, its term vector X d (equivalent to a row of X) is computed, and then converted to the co-ordinate space for documents, D 1 : X d = TSD d X ' D ' S' T ' = d d (applying transpose rule ( AB )' = B' A' ) X = d d T 1 (' )' ( S )' 1 D 1 d (applying inverse rule AA = I ) X = T 1 (' )'( S )' 1 D 1 1 d (applying rule ( A )' = ( A )') But T is orthonormal, so T -1 = T', and S is a diagonal matrix, so S'= S Therefore 1 X ' T ' S = d D d 1 X TS = d D d With the dimensionality reduced, this gives: 5 Note: This is not the only method that may be used to determine the similarity between two documents, but suffices for the purposes of text classification. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 12/48

13 D d = X ' T S d This row D d is then used in the distance calculations above. The term vector X d for the document d is computed by finding all words in the document that exist in the term document matrix X, and building an equivalent row vector of X that represents the frequencies of these words in the new document. For example, if the list of terms before processing document d is <a, b, c, d, e>, and d contains the terms <a, a, c, d, f, g>, the resultant term vector is <2, 0, 1, 1, 0>. Note that the terms in the query document d that did not previously exist in X are not added to this vector as they would have no effect on the results of the classification. This means that a distance of 0 between the query document and a test document does not necessarily imply that the two documents are identical, but merely that they are effectively identical in terms of the classification, as all of the terms in the test document are present in the query document and all terms in the query document that are not present in the test document are not present in any other test document. To learn a document d, or add its content to the semantic model, its term vector X d is simply added as a row to the matrix X. If d contains any words not yet present in X, new columns are added to it to represent these new terms. T, S and D are then computed by SVD as before. A disadvantage of LSI is that the orthogonal base of X'is changed when a new document is added. As a result T, S and D matrices must be recomputed after every addition to X. This is a computationally expensive procedure 6 (O(m 2 n + n 3 )) that increases the time required to learn new documents one by one. It is therefore more efficient in LSI if a batch of documents is learned at once. As the LSI algorithm, unlike the Naïve-Bayes classifier, does not depend on prior word frequencies, the entire training set can be learned at the same time, without requiring the calculation of intermediate X 1 matrices. For this reason, LSI is more suited to a restricted learning environment in which the training set and the test set are clearly separated, and after the assimilation of the training set either the algorithm assimilates no new documents (X 1 is static) or assimilates 6 Golub, Van Loan, 1996 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 13/48

14 documents in batches. Spam filters do not generally work this way however. Ideally, machine learning spam filters should not be restricted to an initial training set, but should have the capacity to improve and change their rules over time. Also, it is unreasonable to expect users to keep spam messages on their machines until a sufficient number has been received in order that they be assimilated by the spam filter algorithm as a batch. Contextual Network Graphs described below avoid both of these limitations. Another disadvantage is that LSI is patented by Telcordia, and it is primarily for this reason that its performance in this project is being compared with that of the unpatented Contextual Network Graph algorithm. 2.3 Contextual Network Graphs A new approach to the problem of contextual search and text classification is using Contextual Network Graphs (CNG). In this approach, the term-document matrix of the LSI algorithm is represented as a weighted, bipartite, undirected graph of term and document nodes, in which arcs are defined between two nodes thus: a, b( a Terms arc( a, b, w) b Documents contains( b, a, w) w > 0) Contains(b,a,w) is interpreted as a document b containing w occurrences of a term a. Arc(a,b,w) is interpreted as there being an arc of weight w between nodes a and b. The graph generates the contextual structure inherent in the data by creating links between keywords and words found in the same context. Semantic information exists as sub-graphs. These sub-graphs contain words that are linked by the fact that they occur in similar situations, defined by paths of varying lengths between two term nodes in the graph. A direct link between two words exists if they are both found in the same document, i.e. for two words w 1, w 2 : Daniel Kelleher Spam Filtering Using Contextual Network Graphs 14/48

15 d Documents arc( w1, d, weight1 ) arc( w2, d, weight2 ) The path between these two words is of length 3: <w 1, d, w 2 > Other words may be indirectly linked, i.e. the path between them is greater than 3. The principle of CNG is that the shorter the path and the greater the weight values between two terms, the more closely related they are in terms of context. Document classification is carried out by energising a new document node d. This energy percolates through the graph as a function of the weights of the arcs between nodes. If a node receives an amount of energy, it is divided by the amount of arcs connected to the node and then added to that node s current energy value and then, if it is higher than a certain threshold, it is distributed among the arcs attached to the node based on the weight of each node. Energise(node N, energy E, node Sender) { E' = E / degree(n) energy(n) = energy(n) + E' if E' > Threshold { for each node N' for which arc(n, N', W) > 0 { if (N' Sender) { } } } } Normalise(N, W, W') Energise(N', E' * W') where energy(n) is the total amount of energy received by the node and degree(n) is the number of arcs attached to this node. The Normalise procedure normalises the arc weights so that the weights of all arcs attached to a Daniel Kelleher Spam Filtering Using Contextual Network Graphs 15/48

16 node sum to 1. This is to ensure that the energy does not increase as it is passed through a node, i.e. the energy received by a node is always greater than or equal to the energy sent from the node to its neighbours. The energy is divided by the number of arcs attached to the node in order to remove the distortion of the results associated with large documents. Documents that contain a large number of words will have more arcs attached to them and will therefore be have more entry points for energy and will be likely to amass more energy than smaller documents. In practise, non-spam s are longer on average than spam s, so this prevents a large number of false negatives as spam s are classed as non-spam, due to the majority of the energy being collected in the larger nonspam document nodes in the graph. Note: In the extreme case, a node may have only one arc, in which case its weight will be 1, and E'* W' will be equal to E. This means the 'out'energy will be equal to the 'in'energy. If a node only has one arc, either this is the document node energised at the beginning, or the arc is attached to the node that sent the energy to it. If the same amount of energy is sent back, there is the possibility of an infinite loop of energy transfer between the two nodes. One possible way of preventing this is to introduce a constant of decay by, for example, normalising the weights to slightly less than 1, thus ensuring that the energy passed on is always less than that received. However, this may still cause exaggerated results as loops in nodes with few arcs (documents with few words) will lose energy less rapidly than loops in nodes with many arcs. Instead, the constraint is introduced that prevents feedback, i.e. energy is not sent back from the receiver to the immediate sender node. Note that this does not prevent loops, but any loops must contain at least four nodes (two term nodes and two document nodes), and therefore each node will have at least two arcs. These arcs will consequently be guaranteed to have weights less than 1, and will therefore cause energy decay at a reasonable level. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 16/48

17 Energy can travel in both directions in cycle. Energy will not return from a leaf node. Document Term This algorithm distributes energy relative to the lengths of the paths and the weights of the arcs between two words, as expected. After all the energy has been distributed, the document is classified by finding the maximum of the energy averages for each class: Class( d) arg max ( avg ( energy( d ) d c)) = c Classes i i i Learning a document is much simpler with CNG than with LSI. A new document node is simply added to the graph by creating arcs between it and all nodes that represent terms that the document contains. The weights of the arcs are based on the number of times each term appears in the new document. Note that in CNG, the addition of a new document to the structure does not involve altering the existing graph, other than the creation of new arcs between the new document node and the term nodes immediately linked to it. Unaffected segments of the graph do not need to be accessed. This Daniel Kelleher Spam Filtering Using Contextual Network Graphs 17/48

18 is contrasted with the addition of a document in LSI, where the entire term-document matrix must be re-decomposed. The above advantage and the fact that the data structure is in the form of a graph means that the data can be distributed among a number of systems, and the graph can be operated on by a number of processes at a time. 2.3 Comparison of Algorithms Naïve-Bayes LSI CNG Document Based on Derived from generalisations Defined by energy Classification statistical based on inherent latent activation techniques across probabilities contextual data a graph Document Updating Recalculation of Addition of nodes to a graph Learning statistics generalisation matrices Semantic No Yes Yes Structure Patented No Yes No Daniel Kelleher Spam Filtering Using Contextual Network Graphs 18/48

19 3. Methods 3.1 Source Material The three algorithms were tested using the spamassassin corpus of spam and non-spam s found at This corpus was chosen because of its relative difficulty ; many the non-spam documents contained spam-like elements, especially those that were taken from mailing lists. This helped testing by reducing the likelihood of unhelpful 100% results appearing on smaller tests. The s contained in this corpus contain full message headers, from which the subject header would be taken and added to the message body to form the input documents 7. The documents are then tokenised crudely by using spaces and punctuation marks as word delimiters. 8 The document classes are encoded by use of the file system: Spam e- mails and non-spam s are stored in separate system directories. The location of each document is then referred to in order to determine its correct class and calculate the accuracy of the classifier. 3.2 Application Structure Application Description When the application begins, the corpus directories are scanned and a list is made of the documents to be classified. These documents are then chosen at random from the list. The lifecycle of a document in the program can be split into two sections: document preparation and text classification. During the document preparation stage, the document is parsed, processed and converted to a list of terms for use by the classifier. The classifier then operates on the document depending on the algorithm being used and returns a classification. The program then compares this classification to the actual class of the document (determined by its directory) and the result of this comparison is reported in a log file. The next document is then selected at random from the list. 7 HTML tags contained in the documents were not removed during the document preparation stage. However, these may be more suitably dealt with by using a rule-based classifier, and removed from a machine-learning classifier like those tested. 8 An alternative approach may be to simply remove punctuation instead of using it as a delimiter. This is not implemented in the project as the corpus used does not warrant it; however, certain spam-writers now employ a strategy of splitting known common spam words with punctuation marks in order to confuse filters. This is easy to detect with a rule-based classifier, however, and may therefore not continue to be the preferred tactic for spammers in the future. In general, it seems that the ideal delimiting technique for spam documents varies over time, and should not be covered in this project. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 19/48

20 This continues until all documents have been processed. Two operations are performed on the document during the document preparation stage. The first, stop-word removal, removes from the document any words that occur on a list of stop-words, the most common words in a language. The second operation uses a word stemmer to strip words of affixes. Both of these operations are described below Application Diagrams Functional Diagram Document chosen Tokenisation Stop-word Removal Stemming Classifiers Results Daniel Kelleher Spam Filtering Using Contextual Network Graphs 20/48

21 UML Class Structure Diagram Daniel Kelleher Spam Filtering Using Contextual Network Graphs 21/48

22 3.2.3 Class Descriptions Frequencies: A class that contains a list of words and their correspondent frequency values. The word list is implemented using the Vocabulary class described below, which stores the words alphabetically in order to facilitate faster searching. The frequency values are stored as a vector of integers. The location of a word in the vocabulary corresponds with the location of its frequency value in this vector. When a word is added to the vocabulary, its location is returned, and this location is used to determine where the frequency value should be added to the frequency list. Once a word is added to the Frequencies object, it cannot be removed. In this way, the correspondence between the word list and the frequency list is maintained. FreqsInDoc: A subclass of the Frequencies class that associates a set of word-frequency Filter: pairs with a specific document. An abstract class that defines the fundamental characteristics shared among all classifiers: The ability to learn and classify documents and the ability to save and load data learned so far. NaiveBayes: The filter class that implements the Naïve Bayes classifier. Lsi: LsiMatrix: CNG: Prepare: Stemmer: Node: DocNode: The filter class that implements the Latent Semantic Indexing classifier. A class that defines a matrix generated from a set of documents and a vocabulary. 9 The filter class that implements the Contextual Network Graphs classifier. The class that implements the document pre-processing stage. Porter s stemming algorithm class (not written by author). An abstract class defining a CNG node, including a vector of edges and energising functions. A subclass of the Node class that is specific to document nodes. It contains a constructer that creates a DocNode from a document and adds it to a preexisting graph by adding edges between it and the relevant term nodes in the 9 This class extends the Matrix class in the Java Matrix (JAMA) Package created by the US National Institute of Standards and Technology. This package is used in order to take advantage of its Singular Value Decomposition function, used in the LSI algorithm. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 22/48

23 graph. TermNode: A subclass of the Node class that is specific to term (word) nodes. Document: A class that represents a document as a list of words and a classification (derived from the file-system, as described in section 3.1). SciNot: A class representing decimals in scientific notation, used to avoid the overhead of using the slower BigDecimal class included in the Java Math library. KFoldFilter: The main class that defines the location of the document files, the test mechanism, the filter types to be used for a specific test and the logging mechanisms. Vocabulary: A subclass of the Java Vector utility class adapted to contain a list of unique words. The words are stored in alphabetical in the vector. This class is implemented using the binary search algorithm to find and add words Stop-word Removal The stop-list used in this project contains 571 words. It can be found at and is based on Salton s SMART information retrieval system 10. Stop-words are the set of words in a language judged to be the carriers of the least amount of semantic content, and are therefore unlikely to be effective in classifying the documents in which they appear. As a result of their lack of semantic content, they are frequently also the most common words in a language, as they include pronouns, conjunctions, etc. For these reasons they are removed from the documents before learning or classification in order to improve computational performance. Once a document is tokenised, it is represented inside the application as a list of words. The removal of stop words therefore simply involves checking each word in a document against the stop-list, and removing it if it matches Word Stemming The other pre-processing technique is word-stemming. This reduces the amount of distinct terms that need to be processed by the classifier by grouping together words with identical morphological 10 Salton, 1971 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 23/48

24 stems. Affixes are removed from words and verbs are de-conjugated as much as possible. The justification for this technique is that words with the same stem will usually have similar semantic content. For example, the words document and documents do not differ semantically by enough to warrant their treatment as two separate words by the classifier. The stemming algorithm removes the s suffix from the latter, giving just one word, document. Word stemming has two advantages; it reduces computation costs by decreasing the size of the vocabulary that needs to be processed, and removes the handicap attributed to stems (or semantic entities) that result in a large number of variant words (such as verbs), being the fact that the frequency of the semantic entity would be distributed among a number of different words, thus diluting its importance in the classifier, similar to the synonymy problem described above. Word stemmers can be found in to main varieties; those that contain stem dictionaries and those that contain affix dictionaries, and correspondent rules governing their use). The algorithm used in this project is Porter s stemming algorithm 11, which employs the latter technique, which, although more likely to make mistakes, is more adaptable to unseen words that do not match a stem contained in the dictionary. Note that for text classification, stemming mistakes do not make much difference as long as the mistakes are consistent. Every word that gets processed by the classifier will have been stemmed, so two identical words will be stemmed, or mis-stemmed in the same way. For example, the word late will be stemmed by Porter s algorithm to lat, as will every other instance of the word late, so lat will be a representation in the classifier for the word late. Essentially, the stemming provides a homomorphism from the vocabulary onto another, smaller one. Two problems may arise from this stemming technique: Two unrelated words may be stemmed to the same token, thus combining their semantic values and distorting the results. Words that act as the stems of other words, and therefore should not be stemmed, may be themselves stemmed, resulting in a mismatch with their variants (e.g. The word apply may be stemmed to app, but the word applies would be stemmed to appl ). The degree to which these problems are important depends on the quality of the stemming 11 Porter, 1980 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 24/48

25 algorithm, and Porter s stemmer used in this project offers a much greater advantage than disadvantage. In the project, Porter s stemmer is applied after stop word removal (the stop list is unstemmed). Each word is sent to the stemmer individually, and is replaced in the document (recall that documents are now stored simply as lists of words) by the output of the stemmer Text Classification This section describes the implementation of each of the text classifiers used in the project Naïve Bayes Classifier The Naïve Bayes classifier is the simplest of all the classifiers, and is represented as one class. Recall that the Naïve Bayes classifer uses the following formula to classify documents: where f ( x) = arg max v V ( P( a j n i= 1 i v j )). P( v j ) P( a i v j ) = nc + 1 n + Vocabulary n is the total number of words (not distinct) in the document subset with class v j n c is the number of occurrences of the attribute a i in these documents The data required by the classifier is therefore: The vocabulary: words previously processed by the classifier. The frequency of each of these words for each class. The number of documents in each class. The vocabulary is stored as an instance of the Vocabulary class, described above (vocabulary). The word frequencies are stored as an array of Frequencies objects (frequencies), with each object in the array referring to the frequencies of the words in the vocabulary for a specific class (in this Daniel Kelleher Spam Filtering Using Contextual Network Graphs 25/48

26 project, Spam and Non-spam). The words are stored as String objects and these objects are referenced both by the Vocabulary object and the Frequencies objects, in order to prevent redundancy that would be incurred the same word was stored a number of times. The total number of words in all documents in a certain class can be determined from the Frequencies objects by performing a sum of all the frequency values in the object (totnumwords). The number of documents in each class is stored as an array of integers (numdocs). The classifier then need just calculate ( n i= 1 P( a i v j )). P( v j for every class v j, using the objects described above, which becomes: ( n frequencies totnumwords ) [ j]. [ j] i= 1 + word ( i) + 1 ). numdocs vocabulary. size [ j] To learn a document: each word is compared with the Vocabulary object. If it does not exist, it is added. The word is then compared with contents of the member of the frequencies array that corresponds with the class of the document. Any words that do not exist in the Frequencies object are added with corresponding frequency values of 1. If a word is found that already exists in the Frequencies object, its frequency value is incremented. Each word is represented as a String object, as mentioned above, so it is the object reference that points to the word that is added to the Vocabulary and Frequencies objects. However, when a word w is compared with the contents of the vocabulary, and a match w 1 is found, the string objects of the w and w 1 will of course be different, as they were generated at the tokenisation of different documents. Furthermore, the word w 1 will already be present in at least one of the Frequencies objects in the frequencies array as its existence in the vocabulary implies that it was contained in at least one document that has been processed some time in the past. Therefore, to avoid all possibility of duplication of words, it is w 1 that is used for the second step of checking the word against the Daniel Kelleher Spam Filtering Using Contextual Network Graphs 26/48

27 frequencies object Latent Semantic Indexing The data structures required for the implementation of the Latent Semantic Indexing algorithm are similar to those required for the Naïve Bayes classifier. Again, a Vocabulary object is used to store a list of terms, and the frequencies of these terms are also stored, but this time, the frequencies are related to each document rather than each document class, so FreqsInDoc objects are used instead of Frequencies objects (see descriptions above) and each document has a class attribute. One FreqsInDoc object is created for each document. A combination of the contents of the FreqsInDoc objects and the Vocabulary generate the columns of the LSI matrix. Recall that the matrix used in the LSI algorithm is a term-document matrix, so the number of terms will determine the number of rows and the number of documents will determine the number of columns. The LsiMatrix class is used to generate the term-document matrix. This class is a subclass of the Jama Matrix class. The value at cell i,j in the matrix will be: 0 if the word at i in the vocabulary does not exist in FreqsInDoc number j or the frequency of the word i in FreqsInDoc number j. Note that after a reasonable number of documents have been processed, in any particular column, most of the cells will have values of 0 as there will be more than twice as many words in the vocabulary as words in the document corresponding to that column. Therefore, the matrix itself will have a majority of cells of value 0, and it is wasteful to store this matrix in memory. Instead, the FreqsInDoc array and the vocabulary are stored and added to when new documents are processed, and the matrix can be derived at any stage from these structures. Before document classification can take place, the term-document matrix must be processed by Singular Value Decomposition (SVD). It is this process that takes the greatest amount of computation, so it can be delayed until the classification stage if documents are being learned in bulk. The SVD algorithm is contained in the Jama Matrix class and returns an SVD object that Daniel Kelleher Spam Filtering Using Contextual Network Graphs 27/48

28 contains three matrices corresponding to T, S and D described in section 2.2. After the SVD is completed, new versions of the T, S and D matrices (T 1, S 1 and D 1 in section 2.2) are created by removing a certain amount of data from each (this corresponds to the dimensionality reduction stage). The amount of data to be retained is determined by a parameter passed to the classifier at object creation, and can either be an absolute number (such as 50, which corresponds to retaining at most fifty factors, fewer if fifty documents have not yet been processed 12 ) or a proportion (such as.2, which corresponds to retaining at most one fifth of the factors, rounded down if the number of documents is not divisible by five). The optimum value for this parameter is to be determined by testing. These matrices are then used to calculate a pseudo-document for the document to be classified. The pseudo-document is then compared with each row of D and the nearest neighbours are found as described in section 2.2. Each row in D corresponds with a FreqsInDoc object, so the classes of the nearest neighbours can be determined by checking the class attribute of the documents of the FreqsInDoc objects corresponding to each one. The number of nearest neighbours to use is defined by a parameter to the classifier, and the optimum number will be determined by testing. Learning in the LSI algorithm involves simply adding columns and rows to the original termdocument matrix (X in section 2.2). The rows and columns of this matrix are derived from the vocabulary object and the FreqsInDoc array that relates to each document. Therefore, to add a new document, a new FreqsInDoc object is generated from the document and any previously unseen words are added to the vocabulary. A FreqsInDoc object is created by generating a list of unique terms from the document and combining them with their corresponding frequency values for the document. These two lists, along with the document object itself (an instance of the Document class) form the FreqsInDoc object which is then added to the array. As stated above, the term-document matrix must be decomposed using SVD before classification is carried out. This is an intermediary step which belongs to the learning or classification steps, as it 12 The maximum number of factors is in fact the rank of the S matrix, which is the minimum of t and d being the number of terms and documents processed so-far. However, as the number of terms will be larger than the number of documents (up to a reasonable amount of documents), this is equivalent to the number of documents processed. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 28/48

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion