Spam Filtering using Contextual Network Graphs

Size: px
Start display at page:

Download "Spam Filtering using Contextual Network Graphs"

Transcription

1 Abstract Spam Filtering using Contextual Network Graphs This document describes a machine-learning solution to the spam-filtering problem. Spam-filtering is treated as a text-classification problem in very high dimension space. Two new text-classification algorithms, Latent Semantic Indexing (LSI) and Contextual Network Graphs (CNG) are compared to existing Bayesian techniques by monitoring their ability to process and correctly classify a series of spam and non-spam documents. LSI and CNG algorithms have an advantage over the Naïve Bayes classifier in the domain of natural language processing as they include a representation of context, or relations between terms. Both LSI and CNG take advantage of these relations to offer a conceptual or semantic-based search, which has been adapted in this paper to the domain of spamfiltering. Contents 1. Introduction Aims Background Spam Spam Filtering Text Classification Spam Filtering as a Context-Heavy Text-Classification Problem Technology Naïve Bayes Classifier Latent Semantic Indexing Contextual Network Graphs Comparison of Algorithms Methods Source Material Application Structure Application Description Application Diagrams Class Descriptions Stop-word Removal Word Stemming Text Classification Measuring Success Storing and Retrieving Data Expected Results Results Data Discussion Conclusion Recommendations Bibliography...47

2 1. Introduction 1.1 Aims The aims of this project are to develop and test an application that applies three machine-learning text-classification algorithms, the Naïve-Bayes Classifier, Latent Semantic Indexing 1 and Contextual Network Graphs 2 to the field of spam-filtering. A number of pre-classified s are processed by the algorithms and the results of each algorithm s classification of the s are compared in order to determine which algorithm is the most successful and what conditions are required for each algorithm to succeed. Success will be measured primarily by the algorithms ability to correctly identify spam and nonspam s, as a function of time, or the number of s processed so far. Other measurements of success will be the time taken to process a single (both classification and assimilation of the new information it contains). It is expected that the amount of computation required to process an will increase as the amount of data increases for some algorithms. The amount of memory used by the algorithm for the purposes of data storage will also be included in the results. Testing the conditions required for the algorithms to succeed will include testing by number of s (corpus size) and various other values such as parameters used in the algorithms in order to determine their ideal values. Other factors that will be taken into account are the abilities of the various algorithms and their data structures to be distributed and run in parallel on different systems. The expected results are that LSI and CNG outperform the Naïve-Bayes Classifier in terms of classification after a certain amount of s have been processed. The primary purpose of the project is to compare the new CNG technology to the patented LSI approach in order to establish if CNG is a viable alternative to LSI in context-heavy domains such as spam-filtering. 1 Deerwester et al, Ceglowski et al, 2003 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 2/48

3 1.2 Background Spam The reception of overwhelming amounts of unsolicited and unwanted s (spam) is a problem that is experienced by almost every user world-wide. By 2004 the epidemic has become so widespread that in December 2003 BBC News estimated that 40% of all s sent are identified as spam, and that identifying and deleting spam costs UK businesses one hour per worker per day. A number of strategies have been tried or proposed to curb this problem. Currently, the proposed strategies can be classified into two different types, prevention techniques, and cure techniques. The success of spam is principally due to the fact that each message is sent in huge numbers, i.e. to a very large number of people, in the hope that a tiny proportion of those that receive the message buy the advertised product. The Wall Street Journal estimates that a response rate of.0001% is enough for the sender of the spam to turn a profit. Prevention techniques generally aim to stop spam from being sent in the first place, by implementing a number of controls and checks on the global e- mail system that will make it more difficult for spammers to send s in such large numbers. In the World Economic Forum in January 2004, Bill Gates desc ribed his suggestions for spam prevention, including forcing each computer sending an to perform a simple calculation. This would not affect users that send personal s only, but will be considerably more expensive for users that send bulk s to large numbers of recipients. Another suggestion Gates made was payment at risk. This would force senders to pay a charge each time one of their s was dismissed as spam. The main problem with prevention techniques is that they require the establishment of a set of controls on the system that would have to be run by corporations. In the case of the payment at risk option, a huge central clearing office would be required to process the payments. These requirements would mean that a corporation or set of corporations could theoretically gain control over the worldwide network. In general, the principle of freedom of the internet would oppose this solution. Other common attempts to solve the problem of spam can be considered cure techniques. The objective of these techniques is to stop spam messages from entering the inbox of their intended recipients after they have been sent. These can be implemented in the form of server- or client-side Daniel Kelleher Spam Filtering Using Contextual Network Graphs 3/48

4 applications that filter out unsolicited s using a number of rules that are either predefined or generated by a learning algorithm. The most common filter applications use keyword lists and ranges of blocked IP addresses to stop potential spam before it arrives in an inbox. So far, however, none of the proposed solutions has had much success, due to the complexity of the problem, both in terms of the ethics of harnessing and restricting a free system like the internet, and in terms of the logistics of keyword-based filtering. The prevention techniques fall outside the scope of this document; this project is concerned purely with the advancement of filtering ( cure ) techniques Spam Filtering Current spam filters come in a number of different forms 3, the most successful being rule-based classifiers such as the original SpamAssassin, and statistical classifiers using Bayesian probability techniques (SpamBayes, later versions of SpamAssassin etc.). Rule-based classifiers usually contain an extensive number of tests, each one associated with a score, that are carried out on an . If the fails a test, its score is increased. After all the tests have been applied, if the s score is above a certain threshold, it is classified as spam and discarded. Rule-based classifiers have the advantage of being able to employ diverse and specific rules to catch potential spam, such as checking the size of an or the number of pictures it contains, however this technique is non-machine-learning in the general sense and therefore rules have to be entered and maintained by hand. This is a considerable disadvantage, as learning algorithms such as Bayesian classifiers can derive rules as they receive more information, hence including rules that are too subtle or too complicated to be entered by hand. The Bayesian technique is described in section Text Classification In mathematical terms, text classification is the partitioning of a set of documents into a number of equivalence classes. Each equivalence class identifies the set of documents that belong to a document type. In the case of spam filtering, for example, there are two document types, spam e- 3 Mertz, 2002 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 4/48

5 mails and non-spam s. The job of spam filters is to create a partition of a document set containing s received in a user s inbox. Currently, the most widely-used text classification tool is the Naïve-Bayes classifier, that uses probabilistic reasoning to classify documents. It is described in detail in section Spam Filtering as a Context-Heavy Text-Classification Problem s, in general, are documents designed for human communication and therefore use natural language, inheriting all the advantages and disadvantages that it possesses. An advantage of natural language from a text-classification point of view is that the semantic structure of natural language gives rise to the presence of keywords; words that contain a large proportion of the semantic meaning of a sentence. These keywords can then be used for searching for or classifying documents. Keywords can be identified in a number of different ways. In rule-based spam filters, they are given by a list of predefined rules that can be changed by hand. For example, a rule may exist that states that the presence of the word Vicodin in an is an indicator that the message may be spam, or that messages that do not include a reply-to address are likely to be spam. In machine-learning text classifiers, the keywords are derived from the analysis of the document set. Annotated documents (the training set) are analysed, and the results of this analysis can be used in the classification of new documents. For example, if the word Vicodin appears more frequently in spam documents in the training set than in non-spam documents, it may be reasonable to assume that Vicodin is a keyword that is commonly used in spam messages, and its presence in new documents would be an indicator of a spam . On the other hand, a word with little semantic import, such as and, may have been found in relatively equal measure in both document classes in and will therefore have little effect on the classification of new documents. Using these techniques, keyword analysis can be a strong tool for text classification. However, natural language, being organic and evolving, is prone to phenomena such as polysemy and synonymy, that weaken the strength of keywords by introducing non-one-to-one relations Daniel Kelleher Spam Filtering Using Contextual Network Graphs 5/48

6 between words and meaning. A word can have several meanings, and the same semantic concept can be represented by several different words. Polysemy and synonymy can have detrimental effects on the accuracy of pure keyword-based classifiers. Polysemy can confuse a classifier as it allows the same word to be used in more than one context. For example, the word play is a highly polysemous word and can exist in a number of contexts. This means that play may not be a strong indicator of a specific class as it may be present in different classes in different contexts. Polysemy reduces the strength of keywords by increasing the number of classes in which a polysemous word can appear. If two words are synonymous, i.e. they have identical meanings, it is probable that they will both be used interchangeably in the same document class. This would, in turn, often result in a reduction of the frequency of each word. If the two words were replaced by a single word, thus removing the synonymy, this word would have a higher frequency than each of the two words, thus making it a more powerful keyword. Synonymy reduces the strength of keywords by spreading their value among a number of synonyms. A more powerful approach to classification of natural language texts is to use context-based searches that take into account the semantic links between words and search over a semantic space rather than simply a list of keywords. This helps to eliminate problems with polysemy and synonymy as the search or classification is based on semantic data instead of keywords. In terms of text-classification, different areas in the semantic space will belong to different classes. When a new document is classified, it is placed into the semantic space, and thus its class can be determined. The subsection of the semantic space that contains documents that belong to a specific class is the class space for the class. The amount of overlap between class spaces in the semantic space is an important factor in the level of success of a context-based classification. Classes with highly specified and compact domains, such as spam s, are easier to use in text-classification than those that have fewer restrictions, such as non-spam s. Spam s are usually quite homogeneous in their content. They have a limited subject domain. Non-spam s can relate to any subject, and are therefore harder to classify, but the domain of a single individual s will be a great deal smaller than the domain of all non-spam s worldwide. For this reason, context- Daniel Kelleher Spam Filtering Using Contextual Network Graphs 6/48

7 based classification works best if the knowledge base is derived from the contents of the inbox of an individual or a small group, as opposed to a global distributed system that uses the s of a large number of unrelated recipients. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 7/48

8 2. Technology 2.1 Naïve Bayes Classifier The Naïve-Bayes Classifier is probably the most popular and most successful text-classification tool used in spam filtering. It has the advantage over rule-based approaches in that it is a learning algorithm, i.e. the more documents it classifies, the more successfully it will classify new documents in the future. It operates under the general premise that a document can be classified based on the words it contains, and the product of the probabilities that each word will be found in documents of a specific class. These probabilities are calculated based on data previously received by the classifier, and are relative to the number of previous occurrences of the words in the set of documents belonging to the class. In spam filtering, there are only two classes, spam and non-spam, and the Naïve-Bayes Classifier determines the probabilities that words in a document belong to either the spam or non-spam classes. Based on these probabilities, the classifier can return a probability value for the entire document for each class. If the probability value for the spam class is higher than the value for the non-spam class, the document is classified as spam. After a document classification is verified (either by a human after the classification, or, in the case of this project, before, when using the pre-classified training set), the data contained in the document are added to the knowledge base of the classifier, in order to improve future classification. The fundamental problem with the Naïve-Bayes probabilistic approach to text classification is that it makes an assumption that each word in a document is independent from the others. This assumption is made for computational purposes, in order to reduce the amount of probabilities that need to be computed. It is this Independence Assumption that makes the Naïve-Bayes classifier unsuitable for natural language processing, since this assumption loses the contextual relations between words and fails to address issues such as polysemy and synonymy that introduce errors to text-classification. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 8/48

9 The Naïve-Bayes classifier attempts to learn a discrete valued function from a set of documents X onto a set of class values V (in the case of spam filtering, the set {spam, non-spam}). f : X V where each x X is a sequence of attribute values (in this case, words) <a 1, a 2,..., a n >. The definition of f(x) is as follows 4 : f ( x) = arg max v V ( P( x v j ). P( v j )) j = arg max v V ( P( a, a2,..., a j v ). P( v 1 n j j )) P(v j ) is calculated as the relative frequency of v j in the training set. P(a 1, a 2,..., a n v j ) is calculated by finding the product of the probabilities of finding each attribute in the documents of class v j. Therefore: P( a n 1, a2,..., an v j ) = P( ai v j ) i= 1 f ( x) = arg max v V ( P( a j n i= 1 i v j )). P( v j ) Note that this assumes the independence of each attribute value, a i. In general it would be too expensive computationally to determine the frequencies of a sequence <a 1, a 2,..., a n > in the set of documents belonging to a specific class and it is for this reason that the independence assumption is used, in order to split the members of the sequence. Intuitively, P a v i ) may be computed as the ratio of the frequency of a i in the document subset ( j 4 Mitchell, 1997 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 9/48

10 with class v j. However, this becomes inaccurate as the frequency of a i becomes very small, with the extreme case being a probability of 0 if the frequency of a i is 0. Instead, each P a v i ) is estimated using the 'm-estimate'of probabilities: ( j P( a i v j ) = nc + mp n + m where: n is the total number of words (not distinct) in the document subset with class v j n c is the number of occurrences of the attribute a i in these documents m is a constant sample size, in this case, the size of the vocabulary, or all distinct words found in all documents in the training set p is a prior estimation of the probability, in this case unknown, and assumed to be a uniform 1/m (i.e. a value inversely proportional to the size of the vocabulary). This gives: P( a i v j ) = n + n c + 1 Vocabulary Note that the absence of any information regarding the other words (or attributes) in the document is a result of the independence assumption that is required in order to use the Naïve-Bayes method, and is refuted in this project. The complexity of classifying a document is O(v.n) where n is the number of words in the new document and v is the number of classes. Assimilating a document is simple, and involves adding new words to the vocabulary and increasing the frequencies per class of the other words. The complexity of this procedure is also O(c.n). Daniel Kelleher Spam Filtering Using Contextual Network Graphs 10/48

11 2.2 Latent Semantic Indexing One of the difficulties in applying a text classification algorithm to the domain of spam filtering is the high dimensionality of the term - document space, due to the volume of the vocabulary present in the documents. Latent Semantic Indexing (LSI) solves this problem and derives the semantic context from the document set at the same time. The algorithm builds a term-document matrix X from the input documents, in which the value of each cell i, j of X corresponds with the number of occurrences of term i in document j. Singular Value Decomposition (SVD) is then performed on X in order to extract a set of linearly independent factors that describe the matrix. Certain factors have smaller effects than others and can be ignored, so that what remains is an approximation of the original matrix that includes generalisations over the data but removes slight fluctuations. These generalisations are the latent semantic information required to perform contextbased analysis. The issues of polysemy and synonymy are handled to a certain extent by using this technique, as the principal functors are no longer words, as in the Naïve Bayes classifier, but a combination of words and semantic data generated by the generalisation. SVD splits a term-document (t x d) matrix X into the product of three matrices T, S and D, where: T is a t x m matrix with orthonormal columns, representing the co-ordinate space for terms, S is an diagonal matrix with m values, ordered by size, and D is a d x m matrix with orthonormal columns, representing the co-ordinate space for documents. X = TSD' By decreasing the value of m, the sizes of the matrices decrease and result in the reduction of the dimensionality of the problem. The resultant matrices, T 1, S 1 and D 1 multiply to give X 1, the approximation of the original term-document matrix, complete with the exposed semantic data. X 1 = T S D ' X Daniel Kelleher Spam Filtering Using Contextual Network Graphs 11/48

12 Though the X 1 matrix is the same size as X, T 1, S 1 and D 1 are considerably smaller than T, S and D, and therefore take up less space in memory and require fewer computational operations to perform calculations on them. Two documents can be compared by finding the distance between two document vectors, stored as columns in the X 1 matrix 5. Instead of using X 1 however, the distance can be calculated using the smaller D 1 and S 1 matrices. Text classification is carried out by finding the nearest neighbours of a query document, and then determining the class of the query document based on a poll taken on these neighbours. Class( d) arg max ( neighbours c ) = c Classes The optimum size of the neighbour set varies with respect to the application of the algorithm and can be determined through testing. To classify a new document d, its term vector X d (equivalent to a row of X) is computed, and then converted to the co-ordinate space for documents, D 1 : X d = TSD d X ' D ' S' T ' = d d (applying transpose rule ( AB )' = B' A' ) X = d d T 1 (' )' ( S )' 1 D 1 d (applying inverse rule AA = I ) X = T 1 (' )'( S )' 1 D 1 1 d (applying rule ( A )' = ( A )') But T is orthonormal, so T -1 = T', and S is a diagonal matrix, so S'= S Therefore 1 X ' T ' S = d D d 1 X TS = d D d With the dimensionality reduced, this gives: 5 Note: This is not the only method that may be used to determine the similarity between two documents, but suffices for the purposes of text classification. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 12/48

13 D d = X ' T S d This row D d is then used in the distance calculations above. The term vector X d for the document d is computed by finding all words in the document that exist in the term document matrix X, and building an equivalent row vector of X that represents the frequencies of these words in the new document. For example, if the list of terms before processing document d is <a, b, c, d, e>, and d contains the terms <a, a, c, d, f, g>, the resultant term vector is <2, 0, 1, 1, 0>. Note that the terms in the query document d that did not previously exist in X are not added to this vector as they would have no effect on the results of the classification. This means that a distance of 0 between the query document and a test document does not necessarily imply that the two documents are identical, but merely that they are effectively identical in terms of the classification, as all of the terms in the test document are present in the query document and all terms in the query document that are not present in the test document are not present in any other test document. To learn a document d, or add its content to the semantic model, its term vector X d is simply added as a row to the matrix X. If d contains any words not yet present in X, new columns are added to it to represent these new terms. T, S and D are then computed by SVD as before. A disadvantage of LSI is that the orthogonal base of X'is changed when a new document is added. As a result T, S and D matrices must be recomputed after every addition to X. This is a computationally expensive procedure 6 (O(m 2 n + n 3 )) that increases the time required to learn new documents one by one. It is therefore more efficient in LSI if a batch of documents is learned at once. As the LSI algorithm, unlike the Naïve-Bayes classifier, does not depend on prior word frequencies, the entire training set can be learned at the same time, without requiring the calculation of intermediate X 1 matrices. For this reason, LSI is more suited to a restricted learning environment in which the training set and the test set are clearly separated, and after the assimilation of the training set either the algorithm assimilates no new documents (X 1 is static) or assimilates 6 Golub, Van Loan, 1996 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 13/48

14 documents in batches. Spam filters do not generally work this way however. Ideally, machine learning spam filters should not be restricted to an initial training set, but should have the capacity to improve and change their rules over time. Also, it is unreasonable to expect users to keep spam messages on their machines until a sufficient number has been received in order that they be assimilated by the spam filter algorithm as a batch. Contextual Network Graphs described below avoid both of these limitations. Another disadvantage is that LSI is patented by Telcordia, and it is primarily for this reason that its performance in this project is being compared with that of the unpatented Contextual Network Graph algorithm. 2.3 Contextual Network Graphs A new approach to the problem of contextual search and text classification is using Contextual Network Graphs (CNG). In this approach, the term-document matrix of the LSI algorithm is represented as a weighted, bipartite, undirected graph of term and document nodes, in which arcs are defined between two nodes thus: a, b( a Terms arc( a, b, w) b Documents contains( b, a, w) w > 0) Contains(b,a,w) is interpreted as a document b containing w occurrences of a term a. Arc(a,b,w) is interpreted as there being an arc of weight w between nodes a and b. The graph generates the contextual structure inherent in the data by creating links between keywords and words found in the same context. Semantic information exists as sub-graphs. These sub-graphs contain words that are linked by the fact that they occur in similar situations, defined by paths of varying lengths between two term nodes in the graph. A direct link between two words exists if they are both found in the same document, i.e. for two words w 1, w 2 : Daniel Kelleher Spam Filtering Using Contextual Network Graphs 14/48

15 d Documents arc( w1, d, weight1 ) arc( w2, d, weight2 ) The path between these two words is of length 3: <w 1, d, w 2 > Other words may be indirectly linked, i.e. the path between them is greater than 3. The principle of CNG is that the shorter the path and the greater the weight values between two terms, the more closely related they are in terms of context. Document classification is carried out by energising a new document node d. This energy percolates through the graph as a function of the weights of the arcs between nodes. If a node receives an amount of energy, it is divided by the amount of arcs connected to the node and then added to that node s current energy value and then, if it is higher than a certain threshold, it is distributed among the arcs attached to the node based on the weight of each node. Energise(node N, energy E, node Sender) { E' = E / degree(n) energy(n) = energy(n) + E' if E' > Threshold { for each node N' for which arc(n, N', W) > 0 { if (N' Sender) { } } } } Normalise(N, W, W') Energise(N', E' * W') where energy(n) is the total amount of energy received by the node and degree(n) is the number of arcs attached to this node. The Normalise procedure normalises the arc weights so that the weights of all arcs attached to a Daniel Kelleher Spam Filtering Using Contextual Network Graphs 15/48

16 node sum to 1. This is to ensure that the energy does not increase as it is passed through a node, i.e. the energy received by a node is always greater than or equal to the energy sent from the node to its neighbours. The energy is divided by the number of arcs attached to the node in order to remove the distortion of the results associated with large documents. Documents that contain a large number of words will have more arcs attached to them and will therefore be have more entry points for energy and will be likely to amass more energy than smaller documents. In practise, non-spam s are longer on average than spam s, so this prevents a large number of false negatives as spam s are classed as non-spam, due to the majority of the energy being collected in the larger nonspam document nodes in the graph. Note: In the extreme case, a node may have only one arc, in which case its weight will be 1, and E'* W' will be equal to E. This means the 'out'energy will be equal to the 'in'energy. If a node only has one arc, either this is the document node energised at the beginning, or the arc is attached to the node that sent the energy to it. If the same amount of energy is sent back, there is the possibility of an infinite loop of energy transfer between the two nodes. One possible way of preventing this is to introduce a constant of decay by, for example, normalising the weights to slightly less than 1, thus ensuring that the energy passed on is always less than that received. However, this may still cause exaggerated results as loops in nodes with few arcs (documents with few words) will lose energy less rapidly than loops in nodes with many arcs. Instead, the constraint is introduced that prevents feedback, i.e. energy is not sent back from the receiver to the immediate sender node. Note that this does not prevent loops, but any loops must contain at least four nodes (two term nodes and two document nodes), and therefore each node will have at least two arcs. These arcs will consequently be guaranteed to have weights less than 1, and will therefore cause energy decay at a reasonable level. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 16/48

17 Energy can travel in both directions in cycle. Energy will not return from a leaf node. Document Term This algorithm distributes energy relative to the lengths of the paths and the weights of the arcs between two words, as expected. After all the energy has been distributed, the document is classified by finding the maximum of the energy averages for each class: Class( d) arg max ( avg ( energy( d ) d c)) = c Classes i i i Learning a document is much simpler with CNG than with LSI. A new document node is simply added to the graph by creating arcs between it and all nodes that represent terms that the document contains. The weights of the arcs are based on the number of times each term appears in the new document. Note that in CNG, the addition of a new document to the structure does not involve altering the existing graph, other than the creation of new arcs between the new document node and the term nodes immediately linked to it. Unaffected segments of the graph do not need to be accessed. This Daniel Kelleher Spam Filtering Using Contextual Network Graphs 17/48

18 is contrasted with the addition of a document in LSI, where the entire term-document matrix must be re-decomposed. The above advantage and the fact that the data structure is in the form of a graph means that the data can be distributed among a number of systems, and the graph can be operated on by a number of processes at a time. 2.3 Comparison of Algorithms Naïve-Bayes LSI CNG Document Based on Derived from generalisations Defined by energy Classification statistical based on inherent latent activation techniques across probabilities contextual data a graph Document Updating Recalculation of Addition of nodes to a graph Learning statistics generalisation matrices Semantic No Yes Yes Structure Patented No Yes No Daniel Kelleher Spam Filtering Using Contextual Network Graphs 18/48

19 3. Methods 3.1 Source Material The three algorithms were tested using the spamassassin corpus of spam and non-spam s found at This corpus was chosen because of its relative difficulty ; many the non-spam documents contained spam-like elements, especially those that were taken from mailing lists. This helped testing by reducing the likelihood of unhelpful 100% results appearing on smaller tests. The s contained in this corpus contain full message headers, from which the subject header would be taken and added to the message body to form the input documents 7. The documents are then tokenised crudely by using spaces and punctuation marks as word delimiters. 8 The document classes are encoded by use of the file system: Spam e- mails and non-spam s are stored in separate system directories. The location of each document is then referred to in order to determine its correct class and calculate the accuracy of the classifier. 3.2 Application Structure Application Description When the application begins, the corpus directories are scanned and a list is made of the documents to be classified. These documents are then chosen at random from the list. The lifecycle of a document in the program can be split into two sections: document preparation and text classification. During the document preparation stage, the document is parsed, processed and converted to a list of terms for use by the classifier. The classifier then operates on the document depending on the algorithm being used and returns a classification. The program then compares this classification to the actual class of the document (determined by its directory) and the result of this comparison is reported in a log file. The next document is then selected at random from the list. 7 HTML tags contained in the documents were not removed during the document preparation stage. However, these may be more suitably dealt with by using a rule-based classifier, and removed from a machine-learning classifier like those tested. 8 An alternative approach may be to simply remove punctuation instead of using it as a delimiter. This is not implemented in the project as the corpus used does not warrant it; however, certain spam-writers now employ a strategy of splitting known common spam words with punctuation marks in order to confuse filters. This is easy to detect with a rule-based classifier, however, and may therefore not continue to be the preferred tactic for spammers in the future. In general, it seems that the ideal delimiting technique for spam documents varies over time, and should not be covered in this project. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 19/48

20 This continues until all documents have been processed. Two operations are performed on the document during the document preparation stage. The first, stop-word removal, removes from the document any words that occur on a list of stop-words, the most common words in a language. The second operation uses a word stemmer to strip words of affixes. Both of these operations are described below Application Diagrams Functional Diagram Document chosen Tokenisation Stop-word Removal Stemming Classifiers Results Daniel Kelleher Spam Filtering Using Contextual Network Graphs 20/48

21 UML Class Structure Diagram Daniel Kelleher Spam Filtering Using Contextual Network Graphs 21/48

22 3.2.3 Class Descriptions Frequencies: A class that contains a list of words and their correspondent frequency values. The word list is implemented using the Vocabulary class described below, which stores the words alphabetically in order to facilitate faster searching. The frequency values are stored as a vector of integers. The location of a word in the vocabulary corresponds with the location of its frequency value in this vector. When a word is added to the vocabulary, its location is returned, and this location is used to determine where the frequency value should be added to the frequency list. Once a word is added to the Frequencies object, it cannot be removed. In this way, the correspondence between the word list and the frequency list is maintained. FreqsInDoc: A subclass of the Frequencies class that associates a set of word-frequency Filter: pairs with a specific document. An abstract class that defines the fundamental characteristics shared among all classifiers: The ability to learn and classify documents and the ability to save and load data learned so far. NaiveBayes: The filter class that implements the Naïve Bayes classifier. Lsi: LsiMatrix: CNG: Prepare: Stemmer: Node: DocNode: The filter class that implements the Latent Semantic Indexing classifier. A class that defines a matrix generated from a set of documents and a vocabulary. 9 The filter class that implements the Contextual Network Graphs classifier. The class that implements the document pre-processing stage. Porter s stemming algorithm class (not written by author). An abstract class defining a CNG node, including a vector of edges and energising functions. A subclass of the Node class that is specific to document nodes. It contains a constructer that creates a DocNode from a document and adds it to a preexisting graph by adding edges between it and the relevant term nodes in the 9 This class extends the Matrix class in the Java Matrix (JAMA) Package created by the US National Institute of Standards and Technology. This package is used in order to take advantage of its Singular Value Decomposition function, used in the LSI algorithm. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 22/48

23 graph. TermNode: A subclass of the Node class that is specific to term (word) nodes. Document: A class that represents a document as a list of words and a classification (derived from the file-system, as described in section 3.1). SciNot: A class representing decimals in scientific notation, used to avoid the overhead of using the slower BigDecimal class included in the Java Math library. KFoldFilter: The main class that defines the location of the document files, the test mechanism, the filter types to be used for a specific test and the logging mechanisms. Vocabulary: A subclass of the Java Vector utility class adapted to contain a list of unique words. The words are stored in alphabetical in the vector. This class is implemented using the binary search algorithm to find and add words Stop-word Removal The stop-list used in this project contains 571 words. It can be found at and is based on Salton s SMART information retrieval system 10. Stop-words are the set of words in a language judged to be the carriers of the least amount of semantic content, and are therefore unlikely to be effective in classifying the documents in which they appear. As a result of their lack of semantic content, they are frequently also the most common words in a language, as they include pronouns, conjunctions, etc. For these reasons they are removed from the documents before learning or classification in order to improve computational performance. Once a document is tokenised, it is represented inside the application as a list of words. The removal of stop words therefore simply involves checking each word in a document against the stop-list, and removing it if it matches Word Stemming The other pre-processing technique is word-stemming. This reduces the amount of distinct terms that need to be processed by the classifier by grouping together words with identical morphological 10 Salton, 1971 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 23/48

24 stems. Affixes are removed from words and verbs are de-conjugated as much as possible. The justification for this technique is that words with the same stem will usually have similar semantic content. For example, the words document and documents do not differ semantically by enough to warrant their treatment as two separate words by the classifier. The stemming algorithm removes the s suffix from the latter, giving just one word, document. Word stemming has two advantages; it reduces computation costs by decreasing the size of the vocabulary that needs to be processed, and removes the handicap attributed to stems (or semantic entities) that result in a large number of variant words (such as verbs), being the fact that the frequency of the semantic entity would be distributed among a number of different words, thus diluting its importance in the classifier, similar to the synonymy problem described above. Word stemmers can be found in to main varieties; those that contain stem dictionaries and those that contain affix dictionaries, and correspondent rules governing their use). The algorithm used in this project is Porter s stemming algorithm 11, which employs the latter technique, which, although more likely to make mistakes, is more adaptable to unseen words that do not match a stem contained in the dictionary. Note that for text classification, stemming mistakes do not make much difference as long as the mistakes are consistent. Every word that gets processed by the classifier will have been stemmed, so two identical words will be stemmed, or mis-stemmed in the same way. For example, the word late will be stemmed by Porter s algorithm to lat, as will every other instance of the word late, so lat will be a representation in the classifier for the word late. Essentially, the stemming provides a homomorphism from the vocabulary onto another, smaller one. Two problems may arise from this stemming technique: Two unrelated words may be stemmed to the same token, thus combining their semantic values and distorting the results. Words that act as the stems of other words, and therefore should not be stemmed, may be themselves stemmed, resulting in a mismatch with their variants (e.g. The word apply may be stemmed to app, but the word applies would be stemmed to appl ). The degree to which these problems are important depends on the quality of the stemming 11 Porter, 1980 Daniel Kelleher Spam Filtering Using Contextual Network Graphs 24/48

25 algorithm, and Porter s stemmer used in this project offers a much greater advantage than disadvantage. In the project, Porter s stemmer is applied after stop word removal (the stop list is unstemmed). Each word is sent to the stemmer individually, and is replaced in the document (recall that documents are now stored simply as lists of words) by the output of the stemmer Text Classification This section describes the implementation of each of the text classifiers used in the project Naïve Bayes Classifier The Naïve Bayes classifier is the simplest of all the classifiers, and is represented as one class. Recall that the Naïve Bayes classifer uses the following formula to classify documents: where f ( x) = arg max v V ( P( a j n i= 1 i v j )). P( v j ) P( a i v j ) = nc + 1 n + Vocabulary n is the total number of words (not distinct) in the document subset with class v j n c is the number of occurrences of the attribute a i in these documents The data required by the classifier is therefore: The vocabulary: words previously processed by the classifier. The frequency of each of these words for each class. The number of documents in each class. The vocabulary is stored as an instance of the Vocabulary class, described above (vocabulary). The word frequencies are stored as an array of Frequencies objects (frequencies), with each object in the array referring to the frequencies of the words in the vocabulary for a specific class (in this Daniel Kelleher Spam Filtering Using Contextual Network Graphs 25/48

26 project, Spam and Non-spam). The words are stored as String objects and these objects are referenced both by the Vocabulary object and the Frequencies objects, in order to prevent redundancy that would be incurred the same word was stored a number of times. The total number of words in all documents in a certain class can be determined from the Frequencies objects by performing a sum of all the frequency values in the object (totnumwords). The number of documents in each class is stored as an array of integers (numdocs). The classifier then need just calculate ( n i= 1 P( a i v j )). P( v j for every class v j, using the objects described above, which becomes: ( n frequencies totnumwords ) [ j]. [ j] i= 1 + word ( i) + 1 ). numdocs vocabulary. size [ j] To learn a document: each word is compared with the Vocabulary object. If it does not exist, it is added. The word is then compared with contents of the member of the frequencies array that corresponds with the class of the document. Any words that do not exist in the Frequencies object are added with corresponding frequency values of 1. If a word is found that already exists in the Frequencies object, its frequency value is incremented. Each word is represented as a String object, as mentioned above, so it is the object reference that points to the word that is added to the Vocabulary and Frequencies objects. However, when a word w is compared with the contents of the vocabulary, and a match w 1 is found, the string objects of the w and w 1 will of course be different, as they were generated at the tokenisation of different documents. Furthermore, the word w 1 will already be present in at least one of the Frequencies objects in the frequencies array as its existence in the vocabulary implies that it was contained in at least one document that has been processed some time in the past. Therefore, to avoid all possibility of duplication of words, it is w 1 that is used for the second step of checking the word against the Daniel Kelleher Spam Filtering Using Contextual Network Graphs 26/48

27 frequencies object Latent Semantic Indexing The data structures required for the implementation of the Latent Semantic Indexing algorithm are similar to those required for the Naïve Bayes classifier. Again, a Vocabulary object is used to store a list of terms, and the frequencies of these terms are also stored, but this time, the frequencies are related to each document rather than each document class, so FreqsInDoc objects are used instead of Frequencies objects (see descriptions above) and each document has a class attribute. One FreqsInDoc object is created for each document. A combination of the contents of the FreqsInDoc objects and the Vocabulary generate the columns of the LSI matrix. Recall that the matrix used in the LSI algorithm is a term-document matrix, so the number of terms will determine the number of rows and the number of documents will determine the number of columns. The LsiMatrix class is used to generate the term-document matrix. This class is a subclass of the Jama Matrix class. The value at cell i,j in the matrix will be: 0 if the word at i in the vocabulary does not exist in FreqsInDoc number j or the frequency of the word i in FreqsInDoc number j. Note that after a reasonable number of documents have been processed, in any particular column, most of the cells will have values of 0 as there will be more than twice as many words in the vocabulary as words in the document corresponding to that column. Therefore, the matrix itself will have a majority of cells of value 0, and it is wasteful to store this matrix in memory. Instead, the FreqsInDoc array and the vocabulary are stored and added to when new documents are processed, and the matrix can be derived at any stage from these structures. Before document classification can take place, the term-document matrix must be processed by Singular Value Decomposition (SVD). It is this process that takes the greatest amount of computation, so it can be delayed until the classification stage if documents are being learned in bulk. The SVD algorithm is contained in the Jama Matrix class and returns an SVD object that Daniel Kelleher Spam Filtering Using Contextual Network Graphs 27/48

28 contains three matrices corresponding to T, S and D described in section 2.2. After the SVD is completed, new versions of the T, S and D matrices (T 1, S 1 and D 1 in section 2.2) are created by removing a certain amount of data from each (this corresponds to the dimensionality reduction stage). The amount of data to be retained is determined by a parameter passed to the classifier at object creation, and can either be an absolute number (such as 50, which corresponds to retaining at most fifty factors, fewer if fifty documents have not yet been processed 12 ) or a proportion (such as.2, which corresponds to retaining at most one fifth of the factors, rounded down if the number of documents is not divisible by five). The optimum value for this parameter is to be determined by testing. These matrices are then used to calculate a pseudo-document for the document to be classified. The pseudo-document is then compared with each row of D and the nearest neighbours are found as described in section 2.2. Each row in D corresponds with a FreqsInDoc object, so the classes of the nearest neighbours can be determined by checking the class attribute of the documents of the FreqsInDoc objects corresponding to each one. The number of nearest neighbours to use is defined by a parameter to the classifier, and the optimum number will be determined by testing. Learning in the LSI algorithm involves simply adding columns and rows to the original termdocument matrix (X in section 2.2). The rows and columns of this matrix are derived from the vocabulary object and the FreqsInDoc array that relates to each document. Therefore, to add a new document, a new FreqsInDoc object is generated from the document and any previously unseen words are added to the vocabulary. A FreqsInDoc object is created by generating a list of unique terms from the document and combining them with their corresponding frequency values for the document. These two lists, along with the document object itself (an instance of the Document class) form the FreqsInDoc object which is then added to the array. As stated above, the term-document matrix must be decomposed using SVD before classification is carried out. This is an intermediary step which belongs to the learning or classification steps, as it 12 The maximum number of factors is in fact the rank of the S matrix, which is the minimum of t and d being the number of terms and documents processed so-far. However, as the number of terms will be larger than the number of documents (up to a reasonable amount of documents), this is equivalent to the number of documents processed. Daniel Kelleher Spam Filtering Using Contextual Network Graphs 28/48

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Semantic Search in s

Semantic Search in  s Semantic Search in Emails Navneet Kapur, Mustafa Safdari, Rahul Sharma December 10, 2010 Abstract Web search technology is abound with techniques to tap into the semantics of information. For email search,

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Fundamentals of the J Programming Language

Fundamentals of the J Programming Language 2 Fundamentals of the J Programming Language In this chapter, we present the basic concepts of J. We introduce some of J s built-in functions and show how they can be applied to data objects. The pricinpals

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

Analysis and Latent Semantic Indexing

Analysis and Latent Semantic Indexing 18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Final Report - Smart and Fast Sorting

Final Report - Smart and Fast  Sorting Final Report - Smart and Fast Email Sorting Antonin Bas - Clement Mennesson 1 Project s Description Some people receive hundreds of emails a week and sorting all of them into different categories (e.g.

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen BOOLEAN MATRIX FACTORIZATIONS with applications in data mining Pauli Miettinen MATRIX FACTORIZATIONS BOOLEAN MATRIX FACTORIZATIONS o THE BOOLEAN MATRIX PRODUCT As normal matrix product, but with addition

More information

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD Goals. The goal of the first part of this lab is to demonstrate how the SVD can be used to remove redundancies in data; in this example

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS A Project Report Presented to The faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Latent Semantic Indexing

Latent Semantic Indexing Latent Semantic Indexing Thanks to Ian Soboroff Information Retrieval 1 Issues: Vector Space Model Assumes terms are independent Some terms are likely to appear together synonyms, related words spelling

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

DEGENERACY AND THE FUNDAMENTAL THEOREM

DEGENERACY AND THE FUNDAMENTAL THEOREM DEGENERACY AND THE FUNDAMENTAL THEOREM The Standard Simplex Method in Matrix Notation: we start with the standard form of the linear program in matrix notation: (SLP) m n we assume (SLP) is feasible, and

More information

The Semantic Conference Organizer

The Semantic Conference Organizer 34 The Semantic Conference Organizer Kevin Heinrich, Michael W. Berry, Jack J. Dongarra, Sathish Vadhiyar University of Tennessee, Knoxville, USA CONTENTS 34.1 Background... 571 34.2 Latent Semantic Indexing...

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Centrality Book. cohesion.

Centrality Book. cohesion. Cohesion The graph-theoretic terms discussed in the previous chapter have very specific and concrete meanings which are highly shared across the field of graph theory and other fields like social network

More information

Interactive Math Glossary Terms and Definitions

Interactive Math Glossary Terms and Definitions Terms and Definitions Absolute Value the magnitude of a number, or the distance from 0 on a real number line Addend any number or quantity being added addend + addend = sum Additive Property of Area the

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set

More information

Retrieval by Content. Part 3: Text Retrieval Latent Semantic Indexing. Srihari: CSE 626 1

Retrieval by Content. Part 3: Text Retrieval Latent Semantic Indexing. Srihari: CSE 626 1 Retrieval by Content art 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent Semantic Indexing LSI isadvantage of exclusive use of representing a document as a T-dimensional vector of

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

Information Extraction Techniques in Terrorism Surveillance

Information Extraction Techniques in Terrorism Surveillance Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism

More information

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Exact Algorithms Lecture 7: FPT Hardness and the ETH Exact Algorithms Lecture 7: FPT Hardness and the ETH February 12, 2016 Lecturer: Michael Lampis 1 Reminder: FPT algorithms Definition 1. A parameterized problem is a function from (χ, k) {0, 1} N to {0,

More information

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND

More information

Query Refinement and Search Result Presentation

Query Refinement and Search Result Presentation Query Refinement and Search Result Presentation (Short) Queries & Information Needs A query can be a poor representation of the information need Short queries are often used in search engines due to the

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

Uncertain Data Models

Uncertain Data Models Uncertain Data Models Christoph Koch EPFL Dan Olteanu University of Oxford SYNOMYMS data models for incomplete information, probabilistic data models, representation systems DEFINITION An uncertain data

More information

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento Towards Understanding Latent Semantic Indexing Bin Cheng Supervisor: Dr. Eleni Stroulia Second Reader: Dr. Mario Nascimento 0 TABLE OF CONTENTS ABSTRACT...3 1 INTRODUCTION...4 2 RELATED WORKS...6 2.1 TRADITIONAL

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow CORE for Anti-Spam - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow Contents 1 Spam Defense An Overview... 3 1.1 Efficient Spam Protection Procedure...

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Probabilistic Learning Classification using Naïve Bayes

Probabilistic Learning Classification using Naïve Bayes Probabilistic Learning Classification using Naïve Bayes Weather forecasts are usually provided in terms such as 70 percent chance of rain. These forecasts are known as probabilities of precipitation reports.

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Readings in unsupervised Learning Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Hashing 2 Hashing: definition Hashing is the process of converting

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II) Chapter 5 VARIABLE-LENGTH CODING ---- Information Theory Results (II) 1 Some Fundamental Results Coding an Information Source Consider an information source, represented by a source alphabet S. S = { s,

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016 CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Admin Course add/drop deadline tomorrow. Assignment 1 is due Friday. Setup your CS undergrad account ASAP to use Handin: https://www.cs.ubc.ca/getacct

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , An Integrated Neural IR System. Victoria J. Hodge Dept. of Computer Science, University ofyork, UK vicky@cs.york.ac.uk Jim Austin Dept. of Computer Science, University ofyork, UK austin@cs.york.ac.uk Abstract.

More information

ELGIN ACADEMY Mathematics Department Evaluation Booklet (Main) Name Reg

ELGIN ACADEMY Mathematics Department Evaluation Booklet (Main) Name Reg ELGIN ACADEMY Mathematics Department Evaluation Booklet (Main) Name Reg CfEM You should be able to use this evaluation booklet to help chart your progress in the Maths department from August in S1 until

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

GoldSim: Using Simulation to Move Beyond the Limitations of Spreadsheet Models

GoldSim: Using Simulation to Move Beyond the Limitations of Spreadsheet Models GoldSim: Using Simulation to Move Beyond the Limitations of Spreadsheet Models White Paper Abstract While spreadsheets are appropriate for many types of applications, due to a number of inherent limitations

More information

Project Report: "Bayesian Spam Filter"

Project Report: Bayesian  Spam Filter Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,

More information

Mathematics Scope & Sequence Grade 4

Mathematics Scope & Sequence Grade 4 Mathematics Scope & Sequence Grade 4 Revised: May 24, 2016 First Nine Weeks (39 days) Whole Numbers Place Value 4.2B represent the value of the digit in whole numbers through 1,000,000,000 and decimals

More information

06: Logistic Regression

06: Logistic Regression 06_Logistic_Regression 06: Logistic Regression Previous Next Index Classification Where y is a discrete value Develop the logistic regression algorithm to determine what class a new input should fall into

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Weighted Powers Ranking Method

Weighted Powers Ranking Method Weighted Powers Ranking Method Introduction The Weighted Powers Ranking Method is a method for ranking sports teams utilizing both number of teams, and strength of the schedule (i.e. how good are the teams

More information

MRT based Fixed Block size Transform Coding

MRT based Fixed Block size Transform Coding 3 MRT based Fixed Block size Transform Coding Contents 3.1 Transform Coding..64 3.1.1 Transform Selection...65 3.1.2 Sub-image size selection... 66 3.1.3 Bit Allocation.....67 3.2 Transform coding using

More information

Predict the box office of US movies

Predict the box office of US movies Predict the box office of US movies Group members: Hanqing Ma, Jin Sun, Zeyu Zhang 1. Introduction Our task is to predict the box office of the upcoming movies using the properties of the movies, such

More information

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, 00142 Roma, Italy e-mail: pimassol@istat.it 1. Introduction Questions can be usually asked following specific

More information

Geographic Information Fundamentals Overview

Geographic Information Fundamentals Overview CEN TC 287 Date: 1998-07 CR 287002:1998 CEN TC 287 Secretariat: AFNOR Geographic Information Fundamentals Overview Geoinformation Übersicht Information géographique Vue d'ensemble ICS: Descriptors: Document

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information