Schematizing a Global SPAM Indicative Probability

Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens, Greece Department of Archive and Library Sciences Ionian University Corfu, Greece Abstract In this paper we propose a middleware infrastructure to address the problem of filtering unsolicitated mail messages (known as SPAM). In our approach we use Bayesian Classifications of SPAM messages built upon categorization models that map a probability to a word using text analysis not only to unsolicitated mails but also to legitimate mail messages, making easier to extract a cumulate inference about the nature of the e-mail message. Our proposed architecture is based on the extension of these models using the advances of Collaborative Filtering Methods expressed via -to-peer networks will help to built more effective and accurate anti-spam filters. Key-Words : e-mail, SPAM, Privacy, -to-peer, Bayesian Classifiers 1 Introduction SPAM [1] also known as mass commercial unsolicitated email, is a fast growing phenomenon to all levels of internet users. Varying from end users to large enterprises such us Internet Service Providers (ISP s), SPAM is the most usual type of email that a typical internet user receives every day. Socio-Technical aspects of SPAM vary from bandwidth costs to security and privacy manners. Furthermore the development of sophisticated types of software crawlers whichmakes easier for SPAMers to acquire the email addresses from people who have made them public via a website or a participating to an internet community such us the USENET news, poses a threat to the use of e-mail as the primary mean for computer mediated communication. SPAM protection currently has two approaches, the first is the legal measures approach which is now being applied to US and EU as a way to punish senders that are responsible for a large number of unwanted emails been sent to internet users, making a violation of their privacy rights. The other side of the coin is cost-sensitive applications of already developed techniques from fields such us information retrieval or text categorization. Following this side we are making a collaborative filtering approach that uses the concept of node interconnections for information exchange which is the main architecture of a peer to peer network. Collaborative filtering reflects the method of exchanging preferences and annotations regarding the same corpora of documents and information. Following the axiom that SPAM is not send only to certain type of users, thus making it a global phenomenon, we address the need for a collaborative filtering infrastructure that will make accurate recommendations about the intention of an e-mail message. In the next paragraphs we make a categorization of current SPAM 1

Table 1: Classification Methods and targeted part of the Mail Message Method Structured Text Filters Verification Filters Distributed Adaptive Blacklists Rule Based Rankings Bayesian Classifiers Message Part Body Header Header Body Body/Header filtering techniques from the part of the mail message that they target. Next we analyse our modelling approach by using a certain type of -to-peer Network Architecture in order to realize the system so that internet users could benefit in terms of reducing the amount of SPAM they receive everyday and also reduce the cost from misclassification of legitimate messages as SPAM. 2 Current Approaches on Spam Filtering Technically speaking SPAM Filtering is a cost sensitive application of text categorization. We characterize SPAM filtering as cost sensitive since the cost in terms of information loss by misclassifying a message as SPAM cannot be predicted. Taking this under consideration, many classification methods have been emerged, influenced from fields such as natural language processing or Information Retrieval. SPAM filtering methods can be classified from the part of the email message that is targeted. A typical email message is consisted of two parts: The mail header,which contains information about the origin of the email message such us the appropriate route that it followed to come to the user s mail server The content of the message which is going to be read by the user Following the above taxonomy several families of SPAM filtering have been witnessed depending on the part of the mail message they target. 2.1 Header Based Filtering Header based filtering is the basic method for large scale implementations of SPAM protection. In that case when an e-mail message is being received by the mail transfer agent (MTA) it is separated from its body. The header contains fields such us the sender and the recipient of the message. Then it is being analyzed using a lexical parser in order to identify the values of the fields. Having header values we can apply two different types of filtering: Verification filtering (also known as whitelist filtering) Trusted Filtering (also known as blacklist filtering) Verification and Trusted filtering are the polar values of the same method, in which the header values are being validated against a vector which is being constructed by the user. Given a domain S of mail messages S i and a set C of predefined categories as C = {SP AM, LEGIT IMAT E}. We consider a vector S that represents the messages such us S = S 1, S 2, S 3,... and a vector P that refers to the classifiers for C such us P = P 1, P 2, P 3,.... The task of categorizing an email message as SPAM may be formalized in the form of approximating the target function Φ : S C {T, F }. In case of Φ(S i, C i ) = T, S i represents a positive example of C i where the classification of the mail message according to the P is the same with the user s preference of classification, otherwise F represent the negative example of the classification process where the user s preferences are different from the automated classification. In both cases (Verification, Trusted filtering) the target function uses the classification vector P to process the message. The difference of the above two methods lays in the construction of the classification vector. Header based filtering is very efficient for the user since the message can be filtered before it arrives to the user s mailbox by using automated processes based on the classification method discussed above. From the other side it cannot be considered as a trusted method of SPAM filtering since it classifies messages based in a very little portion of the 2

message which often can be changed by the senders using network programming techniques. 2.2 Message /Content Based Filtering While header based filtering represents the majority of SPAM filtering policies implemented especially in large organizations, there are also several implementations of SPAM Protection in commercial and Open Source mail clients such us Outlook and Mozilla [2] that target the content part of the mail message. The basic underlaying procedure of classifying a mail messages as SPAM is rule based filtering. Typically a user can construct rules aka sets of classifiers that are activated when a new message arrives. In the simple form the user names a set of words and the rule engine (part of the e-mail software client) tries to match the values of these word with the content of the mail message. Similarly with header based filtering the message is being categorized as SPAM only if it reveals the rule. In a more advanced form of content filtering a combination of rules is being applied to classify a messages based on the overall score that a messages collects when several rules are applied to it. This score then is being validated against a predefined threshold, which is usually defined by the user and the message is being classified when the overall score gets over the threshold. Using message content as the targeted part of the SPAM filtering method, gives the advantage of an accurate mechanism that classifies mail messages based on user preferences about classification terms thus making the filtering method more targeted to the individual user characteristics of mail messages. The main disadvantage of this type of filtering model is that is being built upon certain characteristics of SPAM messages that are not global and interaction with the user is always needed to enter new values or modify existing ones in the rule vector. 2.3 Mixed Mode Filtering Mixed mode filtering addresses a combination of the filtering methods discussed above. Typically mixed mode filtering cumulates classifier decisions by applying filtering methods both to the header and the content part of the e-mail message. Examinations of existing SPAM mail corpora show that a typical SPAM message contains suspicious terms in both parts of the message. Radical implementations of this filtering policy use a mixed mode of interaction with the user in order to find the frontier of influence in the classification decision by the message part. 3 The Bayesian Filtering Approaches A special type of mixed mode filtering can be a bayesian filter that analyses both header and content parts of the message. This type of filtering uses a probability model to characterize a message based on the total probability that is accumulated by specific terms in the header and content parts of the e-mail message. Graham [3] suggested building Bayesian probability models of SPAM and non-spam words. The general pattern is that some words occur more frequently in known SPAM, and other words occur more frequently in legitimate messages. Using this approach we can generate certain probabilities for each attribute of the e-mail message and following a supervised learning period a probability distribution of terms that occur both to SPAM and legitimate messages can be created. Naive Bayes classifiers have been recently proved extremely accurate for SPAM classification [4]. This family of filters works in mixed mode by analyzing both content and header values of spam corpora. A spam message s is represented with a vector x = x 1, x 2..., x n where x 1, x 2..., x n are the values of attributes X 1, X 2,..., X n. In our case attributes correspond to words, i.e. each attribute shows if a particular word (e.g. offer ) appears in the message. Taking apart Bayes theorem and total probability theorem, the probability that a mail message s with vector x = x 1,..., x n belongs to category c is f(c = c X = x) = P (C = c) P (X i = x i C = c) P (C = k) P (X i = x i C = κ) Having κ {SP AM, Legitimate} The above simple formula is the basis for many spam-filtering approaches that have been developed in order to improve the accuracy of SPAM classification for e-mail 3

users (eg. SPAMBayes ). This approach can be customized due to the values of the document vector. The overall SPAM probability of a novel message, based on the collection of words it contains, can be computed as follows: Pi (T erm i ) S d = Pi (T erm i ) + (1 P i (T erm i )) Query Agent S d can be changed during the learning supervision. Following this method we have the above benefits: 1. It can generate a filter automatically from corpora of categorized messages rather than requiring human effort in rule development. 2. It can be customized to individual users characteristic spam and legitimate messages. 3. A probability set can be built and well known methods from decision theory can be applied to improve the accuracy of the filter (for example decision trees). As mentioned before in our system the SPAM indicative probability comes from a supervised learning process which constructs a user profile that is applied not only to a corpus of SPAM messages but also to a corpus of legitimate messages thus making a more coherent probabilistic model of the indicative probabilities of the message terms. We have reviewed SPAM filtering techniques that are widely been implemented to a large variety of e-mail software. We now schematize a collaborative filtering mediator using peer to peer networks in order to permit an exchange of these filters thus making a global SPAM indicative probability that can be used in workgroups and in cases where often SPAM message terms correlate. 4 Exchanging SPAM Indicative Probabilities over to Networks Recently the concept of -to-peer networks has been witnessed as a new kind of decentralized architecture in which nodes of equal roles and capabilities http://spambayes.sourceforge.net Figure 1: to deployed graph exchange information and services directly with each other.[5] The most well known characteristic of a to network is the decentralized architecture that characterizes it by giving advantages such us Not- Singe Point of Failure or independence of escalation. Most types of to are being built upon an anonymous policy that permits anyone to join the network and exchange certain types of information with other peers. In our approach we use a type of peer to peer network that requires a slightly authentication process implemented by invitation exchange between peers. This type of peer to peer network also known as trusted network can also be used efficiently in security applications [6]. By joining end users of the same workgroup in a filter exchange platform we then can be able to combine their filter characteristics in order to apply a probability classification scheme that is going to be more effective as peers join the network and exchange their indicative probability. 4.1 Overall System Architecture As can be seen from Figure 1 we define two certain types of nodes to our -to-peer network: Query 4

Let W be a vector representation of a simple whitelist filter. We declare R I as the header value of a misclassified message r i having r i C(legit SP AM).So let W = R 1, R 2,..., R n. The Query Agent now takes a query in the form Q(s) = P (C = SP AM X = xglobal ) Figure 2: Client-Network Architecture λ = Table 2: λ Values λ% Threshold Supervision 68% Low High 95% Medium Medium 99,7% High Low Agents and Simple s. Query agents handle the service requests that come across the client application. Then the user agent makes a reverse lookup with the total probability distribution of the messages terms that come from the network. We now define a parameter λ that perceives the classification criterion such us ( P (C = spam X ) i = x i ) i=n i=1 P (C = spam X global = x global ) having λ (0, 1], and assuming that λ follows the normal distribution (norm(0, 1)) Following certain values of λ as can be seen from the Table 2, a supervised interaction with the user is required. 4.2 Querying & LoopBack Service The second component of our architecture is the LoopBack service that is used in order to supervise the classification parameter. We now define the cost of classifying a legitimate messages as SPAM as a vector space C(legit SP AM) Having x / R I. Adding more values to the whitelist vector accurate an extra effort to query each time certain nodes about the rule R i, so we are currently examining the concept of creating a certain type of peer (supernode) in our architecture that stores a query history making the querying process more efficient. Reflecting this architecture the classification can be as much accurate as the indicative probability in the network is also accurate about the specific term. Similar architectures regarding SPAM can be found in Zhou[7] based on the approximate object location method in order to identify spam by mediating the matching of email header fields and constructing a fingerprint verification that can be used by the Mail Transfer Agent. The wide use of spoofing techniques where senders hide their address or use false addresses makes the above architecture sensitive to SPAM attacks that use this method to bypass filters. 5 Ongoing and Feature Work This system is a research in progress work. We are currently evaluating the accuracy of the proposed system by creating a functional prototype deployed on top of JXTA API [8]. Considering the pheinomenon of SPAM as an emerging problem for all the users of the internet community we would be happy to collaborate with researchers who have some interest in extending our prototype. References [1] Cranor L.F. and LaMacchia B.A. Spam! Communications of ACM, vol. 41(8), 1998, pp. 74 83. [2] Mozilla Spam Filtering. Http://www.mozilla.org/mailnews/spam.html. [3] Graham A. A Plan for SPAM. Online, August 2002. Http://www.paulgraham.com/spam.html. 5

[4] Sahami M., Dumais S., Heckerman D., and Horvitz E. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers from the 1998 Workshop. AAAI Technical Report WS-98-05, Madison, Wisconsin, 1998. [5] Androutsellis-Theotokis S. A Survey of to- File Sharing Technologies. Tech. Rep. WHP-2002-03, Athens University of Economics and Business, Athens, Greece, 2003. [6] Vlachos V., Androutsellis-Theotokis S., and Spinellis D. Security applications of peer-to-peer networks. Computer Networks, vol. 45(2), June 2004, pp. 195 205. [7] Zhou F., Zhuang L., Zhao B.Y., Huang L., Joseph A.D., and Kubiatowicz J. Approximate Object Location and SPAM Filtering on -to- Systems. In Endler M. and Schmidt D. (eds.), Proceedings of ACM/IFIP/USENIX International Middleware Conference (Middleware 2003), vol. Vol. 2672 of Lecture Notes in Computer Science. Springer Verlag, Rio de Janeiro, Brazil, 2003. [8] Gong L. Project JXTA: A technology overview, Technical report. Tech. rep., SUN Microsystems, April 2001. 6