Spam detection system: a new approach based on interval type-2 fuzzy sets

Size: px

Start display at page:

Download "Spam detection system: a new approach based on interval type-2 fuzzy sets"

Ira Elliott
6 years ago
Views:

Ryerson University Digital Commons @ Ryerson Theses and dissertations 1-1-2010 Spam detection system: a new approach based on interval type-2 fzzy sets Reza Ariaeinejad Ryerson University Follow this

1 Ryerson University Digital Ryerson Theses and dissertations Spam detection system: a new approach based on interval type-2 fzzy sets Reza Ariaeinejad Ryerson University Follow this and additional works at: Part of the Electrical and Compter Engineering Commons Recommended Citation Ariaeinejad, Reza, "Spam detection system: a new approach based on interval type-2 fzzy sets" (2010). Theses and dissertations. Paper 986. This Thesis is broght to yo for free and open access by Digital Ryerson. It has been accepted for inclsion in Theses and dissertations by an athorized administrator of Digital Ryerson. For more information, please contact bcameron@ryerson.ca.

2 Title Page Spam Detection System: A New Approach Based on Interval Type-2 Fzzy Sets By: Reza Ariaeinejad B.Sc. Compter Engineering Islamic Azad University, Iran, 2004 A Thesis presented to Ryerson University In partial flfillment of the degree of Master of Applied Science In the program of Electrical and Compter Engineering Toronto, Ontario, Canada, 2010 Reza Ariaeinejad

3 Athor s Declaration I hereby declare that I am the sole athor of this thesis. I athorize Ryerson University to lend this thesis to other instittions or individals for the prpose of scholarly research. Reza Ariaeinejad I frther athorize Ryerson University to reprodce this thesis by photocopying or by other means, at the reqest of other instittions or individals for the prpose of scholarly research. Reza Ariaeinejad ii

4 Borrower s Page Ryerson University reqires the signatres of all persons sing or photocopying this thesis. Please sign below, and give contact info and date. iii

5 Spam Detection System: A New Approach Based on Interval Type-2 Fzzy Sets Reza Ariaeinejad Master of Applied Science Department of Electrical and Compter Engineering Ryerson University, Toronto, ON, Canada, 2010 Abstract Today, most Internet sers se to commnicate electronically. They depend on the Internet to deliver their important s safely and to the right recipients. However, the fast growth of Internet sers and their se of together with the exponential increase of nsolicited sers sending spam have made the system less reliable. An can falsely be marked by a spam filter on its way to the recipient or even get bried among jnk mail in the recipient s inbox. There are several intelligent anti-spam filters which se different artificial intelligence methods to detect spam inclding neral networks and fzzy logic systems. This paper presents an interval based type-2 fzzy spam detection system. Or reslts show that interval type-2 fzzy set is an effective techniqe for spam detection and classification. The proposed system enables the ser to have more control over the varios categories of spam and allows for filter personalization. iv

6 Acknowledgements First of all, I wold like to thank my family who has always been spportive and encoraging in the most critical moments throghot my life. Also, I wold like to thank Dr. Alireza Sadeghian for his direction, spervision and invalable advice throghot this project. Withot his precios gidance, help and advice, it was impossible to finish this thesis sccessflly. Also, I wold like to thank Dr. Hooman Tahayori for sharing his novel idea of sing the interval type-2 fzzy sets in spam detection with me. I wold also like to thank him for his precios help dring the different stages of the project. Lastly, I wold like to thank my dear friends who helped me stay balanced dring the stressfl times. v

7 Table of Contents Title Page... i Athor s Declaration... ii Borrower s Page... iii Abstract... iv Acknowledgements...v Table of Contents... vi List of Tables... ix List of Figres...x List of Formlas... xi 1. Chapter 1: Introdction and Problem Definition Spam Definition and History Existing Spam-Filtering Methods Motivation Objectives Scope of the Stdy Thesis Strctre Chapter 2: Backgrond Info and Literatre Review Internet Messaging/Mailing System The Process of Sending an Secrity Measres What is Spam? Spam Categories Advanced Spamming Techniqes Anti-Spam Legislation Efforts Poplar Spam-Filtering Methods Gray-listing vi

8 Sender Policy Framework Domain-keys Real-time Black Lists Spam-assassin Learning-Based Spam-Filtering Methods Fzzy Logic Based Spam-filtering Systems Sets: Description and Formalization Intervals Fzzy Sets Description of the Fzzy Sets Concept Fzzy Sets Formalization Main Classes of Membership Fnctions Type-2 Fzzy Sets and Intervals Centroid of a Fzzy Set Chapter 3: Methodology and Experimental Procedre System Design Scheme Bilding the Dictionaries Weight Calclation Parsing Checking with all Dictionaries Bilding Fzzy Maps Distance Calclation Weight of Each Interval in the Map (Freqency of each Word) Evalation and Prediction Dictionary Improvement Reslts and Analysis Chapter 4: Conclsion and Ftre Work Conclsion Ftre Work vii

9 Appendix I. The Unsal Signs in the Words Appendix II. White Dictionary Words List Appendix III. Adlt Dictionary Word List Appendix IV. Other Dictionaries References viii

10 List of Tables Table 1. Contingency Table.. 53 Table 2. Main System Reslts. 54 Table 3. Hotmail Contingency Table 56 Table 4. Hotmail Reslts Table 5. Yahoo Contingency Table.. 57 Table 6. Yahoo Reslts. 58 Table 7. Compter Science Data Set Reslts ix

11 List of Figres Figre 1. Fzzy Set A and its Height, Core and Spport. 30 Figre 2. Overall System Architectre Figre 3. A Sample of the 3D Fzzy Map Figre 4. A Schematic of the Fzzy Sbsets Figre 5. Hotmail Snapshot. 56 Figre 6. Yahoo Snapshot x

12 List of Formlas 1. Interval Width Interval Center Fzzy Set Fzzy Spplement Fzzy Core Trianglar Fzzy Set Trianglar Class Parameters Trapezoidal Fzzy Set Gassian Fzzy Set Non-Symmetric Gassian Fzzy Set Parabolic Fzzy Set Fzzy Type-2 Set Interval Fzzy Type-2 Set FOU of a Fzzy Set UMF of a Fzzy Set LMF of a Fzzy Set J of a Fzzy Set FOU of a Fzzy Set Centroid of LMF Centroid of UMF C-left High C-right Low C-left Low C-right High Union Formla for Dictionary Bild-p Weight Formla Expected Vale Formla Standard Deviation Standard Deviation (Another Way).. 40 xi

13 30. Jaro Distance Matching Formla Transposition Formla Jaro-Winkler Distance Term Freqency Inverse Docment Freqency TF-IDF Formla Fzzy C-mean Degree of Membership N-dimension Center of the Clster Spam Accracy Spam Precision Spam Recall.. 54 xii

14 1. Chapter 1: Introdction and Problem Definition 1.1. Spam Definition and History Internet is one of the most poplar forms of media consmed by or society. The majority of Internet sers rely on to commnicate electronically and they depend on the Internet to safely deliver their to the right recipient. There are millions of s sent and received every day. Among those, there are some nwanted s known as Spam, an expression that originated from a Monty Python sketch [1]. Today, spam refers to jnk, trash or nwanted . The opposite of spam is referred to as Ham, which is a genine or desirable . Spam is generated for many reasons, sch as selling a prodct, acqiring personal information from sers, spreading virses and worms, advertising, political advocacy, etc. Regardless of the reasons for sending these jnk s, they create nnecessary traffic on the networks, impose nnecessary expenses on or resorces and make the ing system nreliable becase of the imperfect natre of spam-filtering systems. A genine can falsely get caght by spam filters before it gets to the right recipient or it may be misplaced among jnk in the ser s inbox. It is estimated that spam costs each US ser $30-50 annally in lost time and costs each employee $730 annally redced prodctivity [2, 62]. Componding those losses, it is frther estimated that US companies lose $8,900,000,000 per year as a reslt of the spam isse. Given these nmbers, it is clear that corporations and individals that se electronic mail and the Internet wold save large amonts of time, money and resorces if they cold avoid spam. The same is also tre for Internet Service Providers, or ISPs, and Service Provider, or EMPs, if the problem cold be solved or at least redced. One of the most important goals of EMPs is to identify and filter the nwanted spam and to make the server more sable. Most Internet sers have experienced some form of spam and are able to distingish between it and ham. However, the first researcher who officially wrote a reqest for comments in 1974 was Joe Postel [3]. The spam isse has been growing ever since. 1

15 1.2. Existing Spam-Filtering Methods There are nmber of measres to limit and prevent spam. These inclde ser awareness, technological soltions (sch as spam filtering) and even throgh legal action. From a technological standpoint, there are software-based tools which detect and block spam atomatically. These tools are called spam-filters. Since spammers se a variety of established techniqes to penetrate sers inboxes, sch as sing fake addresses, there are several filters sed to block these attempts, from blacklists to content-based filters. Content-based filters have been proven to be more powerfl than blacklists. There are three major categories of content-based spam filters: Learning-based, Rle-based and Keyword-based. The Keyword-based filter [30] ses a dictionary of commonly sed spam words and searches for similar words in the text or docment. This filter reqires constant maintenance. Rle-based filters benefit from a vast range of categorized tests that find jnk mail characteristics. It assigns a niqe score to each and then decides if an is spam based on that score [31, 32]. Althogh Rle-based filters work well, they reqire periodic maintenance and pdate also, since their rles are fixed and become otdated over time. Learning-based filters have the benefits of the two aforementioned techniqes with an important advantage over both. These filters se machine learning techniqes to pdate atomatically and do not reqire periodic manal pdates [56]. In addition, here is a list of crrently existing non content-based methods of spam-filtering, which will be explained in detail in the second chapter: Gray-listing Sender Policy Framework Domain-keys Real-time Black Lists Spam-assassin 2

16 Some of these non content-based methods also benefit from the se of machine learning techniqes. However, the main focs in this thesis is content-based spam filters and specifically on the Learning-based sbcategory. There are many different machine learning techniqes that have the potential to be sed in the learning-based spam filters. Techniqes sch as Boosting Tree [31], Spport Vector Machine [34], Decision Tree [35], K-nearest Neighbor [34] and Fzzy Logic [56-61] have already been employed for this prpose with promising reslts. However, the first techniqe that introdced a spam filter based on machine learning techniqes was Sahami et al. [33]. He proposed a Naïve Bayesian classifier trained on the previosly detected spam and ham s to categorize nseen messages. It performed well on nseen messages, which led to the rapid increase of the se of machine learning techniqes in spam detection Motivation Althogh the existing spam detection methods have shown to be effective and reliable, there is always room for improvement. Some methods reqire periodic maintenance. Other methods, which have solved for those weaknesses, are less effective at filtering spam from ham. The motivation for this thesis is to improve the existing methods in the sense of reslts. Frthermore, we shall propose a method that fnctions atonomosly, withot the need for periodic pdates. Also, this proposed method mst prove to be more efficient and mst yield promising reslts Objectives In this thesis, we employ interval type-2 fzzy sets to bild an atomatic spam filter. Type-2 fzzy sets are a generalization of type-1 fzzy sets so they can cope with more ncertainty [37]. From their inception, there have been extensive discssions in fzzy logic and in the greater fzzy sets commnity regarding the problem of membership fnctions of type-1 fzzy sets, which do not have the associated capability to deal with ncertainty. This seems to contradict argments related to the need for fzzy sets. Professor Zadeh [38], the father of fzzy sets, resolved the problem by proposing more sophisticated forms of fzzy sets. The first category of 3

17 these sets is called a type-2 fzzy set. Type-2 fzzy sets are an attempt to resolve the isse associated with the type-1 fzzy sets by incorporating the notion of ncertainty abot their membership fnctions into fzzy set theory. Or proposed approach is to se interval based type-2 fzzy sets to classify an either as ham or as spam. Frthermore, most anti-spam filters cont the nmber of spam words and compare that with the nmber of legitimate words in the message to decide whether or not it is spam [33, 39]. Therefore, smart spammers have devised a new techniqe, which inserts random, meaningless text into the to offset that percentage. As a reslt, the message is considered to be deliverable [40]. However, in or system we try to approach this problem in a systematic way sing fzzy maps that allow s to decide if the message is ham or spam. The proposed system also has an adaptive natre that is if the spam trends change over time with spammers employing new techniqes sch as sing new wording or sing spaces between the word s letters, etc. or system can easily adapt and self-pdate thereby coping with latest spam techniqes Scope of the Stdy The main objective of this stdy is to employ an interval fzzy type-2 system to classify s as either spam or ham and also categorize spam s into the right grops. The scope of this stdy is limited to the s that contain plain text only. Or design is restricted from processing links, pictres or any other material in the bt plain text only. However, we have the ability to deal with the basic spamming techniqes [see section 2.6.] sch as: Scrambled Text, which is leaving spaces or other signs among the word s letters; Invisible Text, which is inserting words to netralize statistical analysis software and always se the same color as backgrond for the text; Letter Randomization, which are long, seless strings designed to netralize signatre-based filters, and Character Set Encoding, which sally ses base64 and printable character set encodings to hide words from clear format of the text. 4

18 Moreover, if spam trends change over time, or system has the flexibility to adapt to the new words [see section 3.8.]. That is to say it is self-adaptive. Additionally, we have sed a techniqe that extracts the roots of the words. For instance, we eliminate the ing or ed from the end of the words and extract the root of each word or at least get as close as possible to the root. Using this techniqe will eventally help s to deal with the other langages that have similar roots with English langage Thesis Strctre The remainder of this thesis is organized as follows. Chapter two provides the backgrond information and literatre review to give the reader as mch information as is necessary to flly nderstand the problem, the existing methods, and also all the information that may be needed to nderstand the way or system implemented and works. Chapter three is the methodology and experimental procedre. Here we introdce or proposed method and explain how it has been implemented. This chapter concldes with or reslts. Or conclsion is articlates in chapter for. Here we smmarize the whole work and provide or perspective on ftre research. 5

19 2. Chapter 2: Backgrond Info and Literatre Review 2.1. Internet Messaging/Mailing System The Internet mailing system Simple Mail Transfer Protocol, or SMTP has been the main protocol for sending on Internet for several years is described, in detail, by RFC 2821[4].. The first SMTP appeared in This has since been sperseded by newer generations of SMTP RFCs, which are backward compatible with some new fnctionality added. An consists of a body and a header. Headers have different fields and are flly described in RFC 2822 [5]. The date header describes the time and date that the was finished and sent ot. From and reply-to fields describe the sender s address and the destination address of a reply. The recipient s address goes in the To field. The Carbon Copy, or CC, and Blind Carbon Copy, or BCC, fields can also specify the recipient. The content of the To and the CC fields can be seen by recipient(s), whereas the content of the BCC field cannot be seen by recipient(s). The Message-ID field specifies the ID of the and the References and In-Reply-To fields show if the is a reply to a previosly sent . The body of the contains the actal message that athor of the has provided [62] The Process of Sending an As soon as the athor has finished the body and header of the , the compter wold try to connect to a SMTP server, which is hosted at the senders ISP. This process is similar to sending a letter by a post-person. The travels among many SMTP servers ntil it reaches the right server, which will pt the in the recipient s inbox. The first SMTP server looks p the recipient s address in the Domain Naming System, or DNS. DNS fnctions like a telephone book. It translates the actal server names to their eqivalent Internet Protocol, or IP addresses. An IP address is a nmerical label that is assigned to any compter or device. The address acts as a node in a network that ses the Internet Protocol for commnication between these mentioned nodes. The look p process will retrn the IP address of the first and closest SMTP server that 6

20 the shold be sent to. This can be the actal recipient or another intermediary SMTP server [62] Secrity Measres According to [insert the name of the professor or the paper that yo are referencing], in order to have a secre and accessible system, we need to be able to spport three different essential secrity measres [19]: Confidentiality: Protecting s and compters from nathorized or nknown access Integrity: Garanteeing that s and compters are not destroyed or distorted throgh an nathorized access. Availability: Ensring that servers meet the reasonable service level What is Spam? There are millions of s sent and received every day. Among those, there are some nwanted s that called Spam. Spam is an expression that was coined in a Monty Python sketch [1]. Today, spam refers to jnk, trash or nwanted . There are many different reasons for sending spam sch as selling a prodct, acqiring personal information from sers, spreading virses and worms, advertising, political advocacy, etc. The opposite of spam, which is a genine or desirable , is referred to as Ham. 7

21 2.5. Spam Categories According to [insert the name of the person or paper that yo are referencing], there are ten main categories of spam [46]. They are: 1- Adlt: primarily consists of content for matre adiences. 2- Financial: primarily consists of financially fradlent content. 3- Frad: contains any sort of frad other than financial frad. 4- Health: primarily contains content regarding pharmacetical goods and prodcts. 5- Internet: primarily contains advertisements. 6- Leisre: primarily contains marketing content devoted to selling leisre goods and services spam: contains content that seeks to solicit a monetary sm from the recipient by garanteeing frther monetary gain following the initial cash advance. 8- Political: contains content pertaining to political campaigns. 9- Prodcts: contains promotional or marketing content for prodcts otside of the health or leisre categories. 10- Scams: Contains any scam not identified in the categories above. 8

22 2.6. Advanced Spamming Techniqes Fifteen years ago spammers did not have to think abot anti-spam techniqes for they did not yet exist. Over time, with the high rate of increase in spamming organizations were forced to think more abot this isse and to deploy effective spam protection techniqes. These new techniqes, in trn, led to the development of new sending techniqes, which allowed spammers to avoid having their spam be detected [19]. A few of those sending techniqes are: Scrambles Text: Breaking p a word by inserting spaces or other characters among the letters of spam words to break a word. For instance, the word capital cold be written as c a p i t a l or c-a-p-i-t-a-l or c*a*p*i*t*a*l. Invisible Text: Inserting additional, irrelevant words in order to netralize statistical analysis software while sing the same color as the backgrond for the text. Typically this shared color will be white, which hides the irrelevant text from the ser when the is rendered by the client machine. Split Words: inserting HTML Tags into them. Splitting or interrpting spam words sch as Lover or Girls by Letter Randomization: Inserting long passages of irrelevant text into the body of the in order to confse signatre-based filters. Character Set Encoding: Using base64 and printable character set encodings in order to hide spam words from the clear text format Anti-Spam Legislation Efforts De to the considerable amont of financial loss generated by spam as well as a lack of existing legal measres to prevent sending spam, new laws and legislations were passed to address this problem. The United States, Canada and the Eropean Union have all taken legal action to 9

23 respond to this isse. In 2002, the Eropean Parliament passed the Eropean Union Privacy and Electronic Commnications Directives (EUPECD), 2002/58/EC. In 2003, the USCAN-SPAM Act, also known as the Controlling the Assalt of Non-Solicited Pornography and Marketing was passed. These are both legal efforts to respond to the spam isse. The EUPECD prohibits nsolicited commercial or marketing commnication except when the sender has obtained the prior consent of the recipient [17]. The USCAN-SPAM Act only permits nsolicited s that adhere to a set of restrictions. For example, deceptive sbject lines are forbidden and senders mst clearly mark the s as advertisements. A legitimate retrn address, a physical address of the sender and an opt-ot link mst also be inclded in nsolicited s. The USCAN-SPAM Act prohibits falsifying header information and illegally sing captred third party compters to relay messages [22] Poplar Spam-Filtering Methods There are several methods sed to destroy or control spam messages [8]. One method is to block all insecre otgoing sessions from the recipient system and have all hosts within the network se a secre server when attempting to send s. This method we prevents absive third parties from sing the network s resorces to relay s. Another method is to detect and filter the spam at the recipient s compter. Spam can either be stopped before it enters the receiver s compter or afterward. The former is carried ot by issing either a temporary or permanent error code in the message delivery sessions. If it is a temporary code, an original compliant sender system will eventally try to send the . If it is a permanent error code, the will be stopped forever. If for any reason, the cold not be delivered to the right recipient, a warning message wold be sent to the sender. However, when the process is sccessfl and the recipient receives the , the sender will not be notified even if the recipient s system ses a spam filter and categorizes the as spam. As sch, sing an internal filter is more likely to increase the risk of losing information. In addition to the content-based filters described above, the following are several methods of non content-based spam detection [62]. 10

24 Gray-listing Gray-listing has been developed with the goal of having minimal impact on sers. It also reqires minimal maintenance [9]. It ses three pieces of information, which is commonly referred to as a triplet. A triplet is the host s IP address that attempts the delivery, the address of the sender and the address of the recipient. If it the filter does not recognize a triplet, it simply denies the delivery for a certain period of time. After that certain period of time, the triplet is no longer nrecognized. The triplet becomes familiar and the messages pass throgh. However, graylisting is sally works on spammers that send spam bt who do not examine the error codes that SMTP generates Sender Policy Framework Sender Policy Frameworks, or SPF, have been adopted by most of large service providers sch as Yahoo, Google and Hotmail [10]. It prevents forging arbitrary addresses as an envelope sender by spammers in SMTP. It forces administrators to pblish the IP addresses that are permitted to send an , which is done at the DNS level. SPF also checks all incoming s to see if they come from a permitted address. If comes from a forbidden address, it is considered spam. SPF is most effective when combined with other spam-filtering techniqes Domain-keys The Domain-keys techniqe is also sed by most major service providers [11]. This techniqe attempts to detect if the header of an has been changed after sending and prevents spam from forging arbitrary sender addresses. With this techniqe, all senders mst sign otgoing s and the system mst generate a pair of keys, one pblic and one private. The private key allows the sender system to sign otgoing s and the pblic key is made available to the DNS server. When the recipient checks these two keys, they mst be identical or system considers the to be spam. 11

25 Real-time Black Lists Real-time Black Lists, or RBL, are sally distribted via DNS servers [12]. They contain the IP addresses of the hosts which are sspected of being insecre. As the name sggests, the lists are pdated in real-time. RBLs work at the network level. If a server s IP is added to the list, all the sers of that sever will be blocked. Also, it assigns degrees to each IP address in the list so some addresses are blocked instantly and others are marked as sspicios. RBL s have several rles it applies in order to consider a server as a potential threat, sch as open proxies, open mail servers, etc. The major drawback of this method is a large volme of false spam ratio Spam-assassin Spam-assassin is a rle-based spam filtering system with an extensive bilt in set of rles [12]. The set of rles that spam-assassin has, is basically a variety of mechanisms sed to identify spam from simple rles sch as looking for a missing sbject line to more complicated mechanisms inclding: Bayesian filtering, White and Black lists, DNS block-lists, networkbased clearing address lists and collaborative filtering databases sch as Pyzor, DCC, etc. After receiving an , it will be checked against the rle set and if it matches the rles, then a certain score (nmber of points) associated to that specific rle wold be added to the checksm of the . The total sm will be checked against a ser defined or defalt threshold. If it is eqal to or higher than that threshold, a warning message will be added to the stating its likelihood of being a spam. In the newer versions of spam-assassin, there are two thresholds to be defined: one is for spam and the other is for possible spam. Spam-assassin is not completely flexible becase the rles and scores are static for a specific version. However, the strength-point of spam-assassin is that it can take advantage of getting inpt from other methods, sch as razor. 12

26 Learning-Based Spam-Filtering Methods A nmber of machine learning classification techniqes have already been proposed for spam filtering applications. Based on [25], the following characteristics of spam filtering tasks may case data mining isses: Sddenly changing class distribtion: the amont of ham and spam changes significantly over time. Uneqal and ncertain error costs: the cost of losing a legitimate message may not be eqal to that of receiving a jnk . Disjnctive and changing target concept: the ability of spammers to develop new spamming techniqes on a reglar basis. Intelligent adaptive adversaries: Spam types and trends change over time. The need for sfficient amont of training data is yet another complicating factor. In [26] a techniqe called co-training was proposed to overcome these problems. It allows the system to be trained by a small portion of labeled data. This data is sed for the systems initial training. From that point on, that system is trained with a larger amont of data which is nlabeled. The data is eventally labeled by the system and is sed in an iterative process to improve the system. With any spam filtering techniqe, two types of errors always occr to some degree: Wrongly classifying spam as ham Wrongly classifying ham as spam Wrongly classifying spam as a legitimate will likely jst inconvenience the ser. However, classifying ham as spam cold case more negative conseqences, sch as the loss of important and valable information. 13

27 The soltion may be the se of game theory [27]. It cold also be one of the two other techniqes proposed by Yih et al., which have low false positive rates. However, different types of sers have different expectations. In a military sitation, information cold be vital and therefore the loss of information cold have serios conseqences, whereas in a personal scenario, the loss of information wold likely be less harmfl. Therefore, the most reasonable soltion is to consider the cost of the two types of errors as a ser-defined parameter [28]. The first researcher who introdced a spam filter based on machine learning techniqes was Sahami et al. [33]. He proposed a Naïve Bayesian classifier which is trained on the previosly detected spam and ham s to categorize nseen messages. It performed well on nseen messages. It also reslted in a rapid increase of sing machine learning techniqes in spam detection. However, there are many different machine learning techniqes that have the potential to be sed in the learning-based spam filters. Techniqes sch as Boosting Tree [31], Spport Vector Machine [34], Decision Tree [35], K-nearest Neighbor [34] and Fzzy Logic [56-61] have already been employed for this prpose with promising reslts Fzzy Logic Based Spam-filtering Systems The reslts of clearly prove that fzzy logic is an excellent method for spam detection [56 60]. The se of lingistic variables and approximate reasoning makes fzzy logic an ideal techniqe to model a problem and arrive at a sefl answer. Below is a smmary of the work of several researchers that effectively tilized fzzy logic methods in the field of spam detection. Fad et al. introdced a trainable fzzy type-1 spam-filter [56]. The filter has a learning period dring which it cold develop an effective rle-set and then apply the rle set to classify nseen messages. The other positive aspect of their filter is that it was not limited to text. They increased the efficiency of their filter by considering and extracting other aspects of the from its header, sch as an empty sender field, etc. 14

28 Kim et al. presented a fzzy type-1 inference approach as a featre selection method and performed a comparative experiment on the Adlt spam category [57]. Their system s performance improvement, in terms of spam accracy, was not extremely significant compared to conventional spam-filtering systems. However, other factors, sch as the average error rate, spam precision and spam recall, were approximately 6-10% higher than those of conventional systems. El-Alfy et al. [58] introdced an interesting fzzy similarity-based spam-filtering method. The method considered the similarity of the content of the message to predict the category of spam instead of relying on a fixed pre-specified set of keywords. It therefore had the advantage of partially adapting to the new techniqes and tactics that spammers cold develop. El-Alfy et al. created a sort of knowledge-base to keep track of these spamming tricks dynamically. They showed that their system can provide better reslts compared to naïve Bayesian classifier. H et al. [59] presented a spam-filtering system that interestingly was based on fzzy clstering instead of fzzy classification. Their system did not have any training period, nor did it take any time for training. Also, their reslts showed reasonably good filtering qality. The advantages of their system were: high flexibility, redced need for manal labor, fewer privacy isses and reasonable efficiency. Meizhen et al. [60] proposed a very interesting behavioral approach to spam detection. Instead of determining if a message is spam or ham merely based on its content, they developed a spam behavioral recognition model based on a fzzy decision tree. Their system analyzed the information from all characteristics of s to process them with their fzzy decision tree. Then, by data globalization, they arrived at potential behavioral featres of the s. Althogh their reslts are not very impressive, their approach is novel and notable. Tahayori et al. [61] introdced an interval type-2 fzzy set methodology for the classification. They have mathematically discssed the techniqe as a good method for classification bt no reslts have been reported since it was never implemented. Also, it seems that their approach missed a key part, which is 3 rd dimension of their maps. They have also 15

29 sggested that this techniqe is not sfficient for a complete classification and it mst be combined with other techniqes. In or proposed system, we will se interval type-2 fzzy sets to categorize s rather than type-1 fzzy sets. If flly fnctional, or system will be capable of self-pdating, throgh the atomatic pdates of its dictionaries. In other words, it is designed t be adaptive to changes in spam trends over time. We have developed a new way of calclating and assigning the weight of each word in or dictionaries, as described below in section To determine if an is spam or ham, we will develop a 3D fzzy map, in section 3.6, for each . Then we will categorize the by sing a data clstering techniqe on that map. We have tried to develop a powerfl spam-filtering system that will be capable of recognizing nseen spam and adapting to potential changes in spamming techniqes as efficiently as possible Sets: Description and Formalization Indeed, we can show every mathematical concept with the set theory [21], from natral to real and rational nmbers. Using the set theory, on the one hand, accommodates conceptal innovation and on the other hand, makes it easy to represent information of a complex natre. For nderstanding the set theory we need to first nderstand the concept of Universal Set or X. Universal set represents the niverse of discorse which contains any possible element that in any way relates to or prpose. An important and most sally sed niversal set is the set of all points in n-dimensional space which is shown as R. The relationship of each element with the niversal set or any other sbset of the niversal set is shown by the sign the sign (belongs to) or the sign (doesn t belong to). We can define a set in two different ways. One way is to define the set itself and the other way is to define the set by its elements. Also, the total nmber of the elements of each set is called the cardinality of the set. And it s shown by A. The concept of a sbset is shown by the sign. We can say A = B only if A B and B A. if a set has no elements in it we call it an empty set and show it by the sign. Also, the compliment of a set is shown by A. 16

30 2.10. Intervals Intervals are special cases of sets [21]. These are connected sbsets if in the R, which give s a good way of approximating real life sitations. We show the intervals by lowercase letters in brackets, for example: [x]. They have lower and pper bonds that are shown by x, x, so intervals may be closed or not. Here are the formlae to calclate the center and the width of an interval: Width ([x]) = x x (1) Center ([x]) = ( x x ) / 2 = x + width ([x]) (2) Fzzy Sets Description of the Fzzy Sets Concept In the ordinary set theory we have solid bondaries, either limited or limitless, for a set. However, fzzy sets offer a new concept of continos bondaries. This idea is based on the hman perception of processes. In life, we do not se only yes and no when we talk or describe something either a concept or an idea. Or natral langage is the best illstration of the fzzy set concept. We se words sch as small, big, hot, cold, long, short, etc. which describe relative traits that do not have solid bondaries. Althogh we have no problem sing these concepts in real life, it cold make a lot of troble when we want to express them in a set-based model. For instance, yes can represent 1, or inclde, and no can represent 0, or exclde, for the elements. However, we do not have anything for concepts sch as high pressre or hot temperatre. We need to have some sort of element that can represent these concepts [21]. Zadeh was the first researcher who came p with a soltion. He invented the term Fzzy set which admits partial membership of elements. We define a fzzy by its membership fnction A(x) or A (x): A: X [0, 1] (3) 17

31 A(x) defines the degree of membership for each single element of x in A. If A(x) = 1 then it means that x flly belongs to A and if A(x) = 0 it means that x doesn t belong to A at all. However, when A(x) is between 0 and 1 it means that x partially belongs to A based on the vale of A(x). The larger the vale assigned to A(x), the stronger the degree of association Fzzy Sets Formalization We can flly describe the fzzy sets by their membership fnctions. However, we need descriptors to help in their characterization. The descriptors of a fzzy set are: height, core and spport [21]. Figre 1. Fzzy set A and its height, core and spport Hgt (A) or the hight of A is the spremm of the membership fnction. If the hgt (A) =1 then we call the fzzy set a normal fzzy set and in any other case, we call it sbnormal. Sbnormality is sally the reslt of dealing with a concept that has no elements that can flly satisfy it. The Core (A) is all elements which totally satisfy the fzzy set and the Spp (A) or spport of A is all elements that are partially satisfying the fzzy set. 18

32 Spp (A) = {x X A(x) > 0} (4) Core (A) = {x X A(x) = 1} (5) Main Classes of Membership Fnctions The way that a partial membership is represented depends on the concept and a good membership fnction shold be selected based on the needed application. There are many different ways of representing the membership fnction based on the problem. Below, the most famos ones are listed [21]. Trianglar Fzzy Sets: A(x; a, m, b) = 0 x a m a b x 1 b m 0 if if if if x a x [ a, m] x [ m, b] x b (6) The parameters of the class of fzzy sets describe the linear segments of the membership fnction and it cold be rewritten in concise formatting with min and max fnctions. A(x; a, m, b) = max {min [(x-a) / (m-a), (b-x) / (b-m)], 0} (7) Trapezoidal Fzzy Sets: 19

33 20 A(x; a, m, n, b) = b x if b n x if m b x b m n x if m b x if a m a x a x if 0 ], [ 1 ], [ 1 ], [ 0 (8) Gassian Fzzy Sets: A(x; m, ) = exp (- (x-m) 2 / 2 ) (9) Non-symmetric Gassian Fzzy Sets: A(x; m,, µ) = m x if m x m x if m x / ) ( exp( / ) ( exp( (10) Parabolic Fzzy Sets: A(x; m, p) = otherwise p m p m x if m x p 0 ] 1/, 1/ [ ) ( (11)

34 2.12. Type-2 Fzzy Sets and Intervals A fzzy set of type-n, n=2, 3,,, was first proposed by Zadeh [41]. The membership fnction of sch a fnction ranges over fzzy set type-1 where the membership fnction of fzzy type-1 ranges over [0, 1]. Based on this, a fzzy set type-2, A, ~ is characterized by a membership fnction: J ~ : U [0,1 ] A where the vale of ( ) A ~ which ranges over [0, 1] or in the sbset of J of [0, 1] [42]. is called a fzzy grade and is a fzzy set ~ Si A ( ) / ~ U A U J i i ( ) J [0,1], 0 S, i I i ( ) ( ) /, J (12) Here, J I is the index set of J that is completely consistent with the reality that ~ A is a fzzy ( ) membership grade. Therefore, is the th ( ) i vale of with the strength of S. i i ( ) J i ( ) i forms the main membership and ( ) S i ( ) ( ) J i i indicates the fzzy grade or secondary membership fnction of the member. The amont of change of the secondary membership fnction can also be called secondary grade. Therefore, ( S i ( ( ) J i i ) ) membership vale of in A. ~ for any given U is a special type-1 fzzy set that defines the 21

35 Domain of Uncertainty (DOU) is a set of { ( U ij i ) /, S ( ) i 0 } [43, 44]. In the type-2 fzzy set, DOU does not say mch abot the strength of each membership degree and only identifies the region of primary membership degrees. However, where the fzzy set is continos with natrally ordered primary vales, the domain is called the Footprint of Uncertainty or FOU. Ths, we shall se the FOU and DOU interchangeably. We have observed that a type-1 fzzy set is a special instance of type-2 fzzy sets, which for all U the set of primary membership degrees, for instance strength of nity. ij ( ) i is a singleton with the Now that we have discssed abot type-2 fzzy sets, we shall introdce the special case of a type-2 fzzy set which is an interval type-2 fzzy set. If S ( ) J then the type-2 fzzy set gets down-sized to an interval i 1, type-2 fzzy set. U and i I A discrete interval type-2 fzzy set is shown by: ~ U A 1 /, J [0,1], i I ( ) J i i (13) J Therefore, the footprint of ncertainty is defined as: FOU( A ~ ) J U (14) 22

36 2.13. Centroid of a Fzzy Set Upper membership fnction (UMF) and lower membership fnction (LMF) of ~ A where ~ A is a type-1 fzzy set are also fzzy sets and represent the pper limit and lower limit of FOU( A ) respectively. They are defined as: ~ ~ UMF( A) ~ A U ( ) ni /, n i I J (15) ~ LMF ( A) ~ A U ( ) l / (16) Therefore, we can formlate: J [ ~ A A ( ), ( )] ~ (17) ~ FOU ( A) [ ( ), ( )] ~ ~ U A A (18) It is important and notable that FOU has the prime role in characterizing an interval type-2 fzzy set. The bondary of the endpoints of a centroid which serves as the measre of the ncertainty of an interval type-2 set is defined as follows: c LMF ( ) ( ) (19) c UMF ( ) ( ) (20) 23

37 24 ), min( UMF LMF l c c c (21) ), max( UMF LMF r c c c (22) )) ) ( ( ( )) ( (( )) ) ( ( ( )) ( (( ) ( )) ( ( )) ( ) ( ( Sp Inf Sp Inf c c U U U U l l (23) )) ) ( ( ( )) ( (( )) ) ( ( ( )) ( (( ) ( )) ( ( )) ( ) ( ( Sp Inf Sp Inf c c U U U U r r (24)

38 3. Chapter 3: Methodology and Experimental Procedre 3.1. System Design Scheme Figre 2 shows a schematic of or design. We will describe each stage of the design to show how it works. Figre 2. Overall System Architectre 3.2. Bilding the Dictionaries A pre-processing procedre needs to be performed at the beginning to bild a dictionary for each category of s. Or design incldes one main dictionary for each category of spam and one general white dictionary for all categories. Since we have ten different categories of spam, we need a word dictionary for each one of them. In total, we will se eleven dictionaries, these ten category dictionaries pls one white dictionary. 25

39 The first step in the process is to create a white dictionary, or white list. This is the first step to be done. The prpose of this white list is to eliminate some common words that are sed in both ham and spam, bt that are not necessarily a sign of a spam. The list incldes words that are often fond in s bt that are not considered to be indications of a spam, sch as simple verbs - do, does, etc. - or some common words, sch as yo, me, etc. We have identified one hndred words that are to be considered white words. In other words, or white dictionary has one hndred elements. Thogh by no means exhastive, this list will sffice for or prposes. This white dictionary is bilt manally. For each category of spam, we need a word dictionary of common words that are sally sed in that particlar category. For instance, the word Sex will be kept in the Adlt category dictionary and the word Viagra will be kept in the Health category dictionary. For bilding or dictionaries, we se fifty spam s per category. We then calclate the intersection of each single file ot of the fifty in that specific category with the nion of the rest of the forty nine files. Finally, we compte the nion of these intersections and make or dictionaries based pon this final nion. n ( A i1 i n ( A )) j1 ji j (25) For A, i 1,2,3,..., n i where A represents the spam files which we have sed as or base-line to make the base dictionaries. We also tried the simple nion of all fifty files in the Adlt category and we came p with 319 words other than the white. However, when we sed or formla we came p with 246 words. After careflly reviewing the acqired words, we learned that the extra words we acqired throgh simple nion were not a sign of a spam in that particlar category. Based on this learning, we decided to se this formla for all of the remaining categories. It needs to be mentioned that we bilt or dictionaries white word free. This means that after we acqired the words from the file and before sing the above formla, we checked the words 26

40 against or white dictionary. The white words were eliminated from the list. Then, after sing the formla, we constrcted the category dictionary with the remaining words. In or system, each dictionary is a table. Each element of that table is a strctre containing five elements: one element for the actal word, one element for the maximm nmber of which this specific word has been fond in any of last fifty files, one element is the cont of all sch specific word in all files. This element will be sed to get the average, or mean, of the freqency of that word. The next element is a list of two nmerical elements which holds the pper and lower limits of weight, when the weights get calclated [see section 3.6.2]. The last element is a list of the fifty elements and each element holds the cont of the word in each of the last fifty files Weight Calclation For each word in or dictionary, we have a related weight. To calclate this weight, we have a bffer of fifty vales for each word, which each vale is the cont nmber of that specific word in each of the last fifty files we have worked on recently [see section 3.2]. For example, the word free in or dictionary has a bffer containing fifty vales. Vale nmber one is the nmber of instances of the word free that have been fond in the first file processed. Vale nmber two is the nmber of instances of the word free that we have fond in the second file and so on [see section 3.2]. After we gathered the words in or dictionaries, we calclate the weight of each word based on the average, or mean, of their freqency and the nderlying standard deviation [52]. We denote the weight as an interval. Weight = [AVG STDEV, AVG + STDEV] (30) As we go forward with new s, we assign the oldest vale for each word in or bffer and add the new vale to the bffer. However, we have a policy in or system that once a word enters or dictionaries, its weight never goes down to zero. If the last fifty s we have 27

41 processed had zero of that specific word, or system atomatically sets the weight of that word to minimm (0.0001). This policy permits adaptation to changes in spam trends. If spammers stop sing a specific word for a period of time, or system sets the minimm weight for that word. However, that word is always considered in or fzzy classifier and when spammers start sing that word again, or system raises the weight based on the average and the standard deviation of or bffer. We cold have adopted a larger bffer, i.e. more than 50 vales, for or system. However, there is a tradeoff since the larger the bffer becomes, the slower the system also becomes. A small bffer of fifty therefore seems reasonable. We arrived at the nmber fifty throgh the process of trial and error. Standard deviation [52] is calclated as: If X is a variable with the mean vale of µ then: E X (31) Here in the formla, operator E represents the average vale of X. Therefore, we have the standard deviation of X as: 2 E X (32) When X gets random nmbers from a limited data set with the same probability, σ (sigma) or the standard deviation is defined as: 1 N ( i N 1 x i ) 2 (33) 28

42 It shold be mentioned that since we calclate the weight based on the mean and standard deviation, there are some cases where the data is qite scattered and the standard deviation vale is qite large. In these cases, we arrive at a negative lower limit nmber for the weight. In order to eliminate this isse from or system we always shift the average by 10 points. In this way, or baseline will rise by 10 point and we will not have negative vales for either of or limits. This change does not affect or reslt since all of the weights are increased by 10 points. This measre merely prevents s from working with negative weights Parsing The next step is to work on the body of the . For this goal, we make a linked-list of or text where each node on or list contains two elements: a string element that holds the actal, single word and a nmerical element that holds the cont of that specific word in the entire . We also have the ability to move both forward and backward on the list. The reason that a linked-list has been sed in or implementation is that working with a linked-list is easy and sfficient for or prposes. When we prodce the linked-list for the first time, the cont index for each word is one. Therefore we will have a lot of redndant words in the list. For example we might have forty instances of the word yo in an . We will eliminate the redndant words and increase the cont index of elements of the word later. We also eliminate all the white words, bt keep track of their nmber. For example, we record how many instances of the words yo or me there are that we eliminated from the linked-list. Since spammers se different techniqes to deceive the spam filters, we need to be carefl in the parsing process. One of the common techniqes among spammers is to insert spaces between the word letters. In order to netralize this spamming techniqe, whenever we see a one letter word we check the next word immediately. If the second word was a one letter word as well, then we check the next word ntil we find a word withot a single letter. Then we combine the whole single letter words together and we see them as one word. It shold be mentioned here that the 29

43 two single-letter words, I and a, are already in or white dictionary, however, neither I nor A will be followed with another single letter in an . Therefore, these two words wold create a problem. Once this first step is complete, we se a fnction known as a strip fnction. This fnction will take each word and remove all of the endings, sch as ing or ed, from the word s stem. This fnction also checks for any nsal signs in the word. Ths L$o$v$v$e will be changed to Love. A list of nsal signs is provided in appendix Checking with all Dictionaries In this stage, we need to compare the message with the dictionaries. Before doing so, we need to eliminate all redndant words in order to have a sample of each existing word as well as a nmber that shows how many instances of that specific word have been fond in the . To do this, we keep the first word and remove all other redndant words from the list and increase one point to the cont element of that first word. Then, we check all remaining words against the white dictionary and exclde, or flag, all white words from the list and thereby from the search. Now we have a clean list of words, along with their cont, which is ready to be checked against the category dictionaries. We will find ot if there are any similarities between the words in the message and the words in the category dictionaries. In or system, any athorized ser can activate or deactivate any of the category dictionaries according to his or her preferences. Ths, we only check the words with all of the active dictionaries. Once that has been carried ot, we will come to an overall decision based on all checked dictionaries. For checking similarities, we se the Jaro-Winkler Distance techniqe [48, 49]. Before deciding to se the Jaro-Winkler distance techniqe, we tried the Levenshtein Distance [50], the Smith- Waterman Distance [51] and the Jaro-Winkler Distance techniqes. Based on the accracy of the reslts we observed sing each techniqe, the Jaro-Winkler Distance techniqe proved to be the most effective of the three. Ths we determined it to be best sited to or prposes. We define the Jaro distance d j of two strings s 1 and s 2, where s denotes the length of string s as following: 30

44 d j 1 ( 3 m s m s m t ) m 1 2 (26) where: m : matching characters nmber t : transpositions nmber If two characters s 1 and s 2 are not farther than the following then we consider them matching: max( s, s ) 1 (27) The nmber of transpositions is the nmber of matching characters of two strings divided by two. It is defined as: m t 2 (28) Jaro-Winkler distance d w for any two strings s 1 and s 2 is defined as: d w d j l p (1 d )) ( j (29) where: dj : strings s 1 and s 2. Jaro Distance l : length of common prefix at the start of the string, to a maximm of 4 characters. p : constant scaling factor for how mch the score is adjsted. p shold not be more than 0.25, or the distance can become larger than 1. The standard vale for p in Winkler s paper is p = 0.1 [49]. 31

45 Using the Jaro-Winkler techniqe in all comparisons, we first check or linked-list against or dictionaries to find any possible similar patterns. These we call hit words if the similarity is 100% and the distance between them is zero. Frthermore, for all words that have a similarity of more than 80% and a distance of less than 0.2 ot of 1, we refer to them as similar words. If the similarity is less than 80%, we do not name them. In these cases, we simply record the distance. We keep track of all similar, hit and white words along with their associated weights, which are determined by or dictionaries, and their associated distances Bilding Fzzy Maps Fzzy maps are bilt based on information that we have already extracted from the . We se an interval type-2 fzzy paradigm to decide whether the is ham or spam. We bild or fzzy maps based on the distance of each word in the with or dictionaries, the weight of the closest entries in the dictionary with the word [see section 3.2.1] and the freqency of the se of each word in the (TF-IDF). We se a three dimensional (10,000 x 10,000 x 1) matrix with which to bild or map. The first and second dimension of the map, or matrix, holds the distance (10,000) and weight vales (10,000) and the third dimension of the map, or matrix, holds the TF-IDF vale (only 1). It is notable that since or matrix has only 10,000 vales for each distance and weight vectors in the map, all vales which exceed the limitations of 4 digits in the map mst be ronded to the closest vale near it. For example if we have a distance ( ), it will be ronded to (0.3455). The same is also tre for the weights. First, we calclate or extract each Distance (section 3.5.1), Weight (section 3.2.1) and Weight of each interval (section 3.5.2). Second, we pt these each weight intervals on the map on their proper distance vales. Third, we associate the third dimension of the map, or all points that are nder the actal weight interval, with the weight of that specific interval. Figre 3 shows a sample of the map. 32

46 Figre 3. A Sample of the 3D Fzzy Map Distance Calclation To calclate the distance of a word with a dictionary, first we calclate the distance between the word and all dictionary entries sing the Jaro-Winkler Techniqe [48, 49]. If, in any case, we find more than one entry in the dictionary having the same distance with the searched word, we will choose the maximm, or nion, weight of those matched entries. We set the maximm distance (1) for all white words. 33

47 Weight of Each Interval in the Map (Freqency of each Word) After bilding the first and second dimension of the map with distance and weight of each word, it is time for the third dimension, which is the freqency of se of each word in the . We need to have a specific weight for each word s related interval in the map that represents the freqency of that particlar word in the crrent . This weight is different from the weight of the word in or dictionary. We se this specific weight as a statistical measre to evalate and jdge the importance of each word to the whole . This weight or importance of the word increases proportionally to the nmber of times that the word is sed in the whole . Here, we employ TF-IDF (term freqency-inverse docment freqency) techniqe [53], to calclate the specific weight for each interval in the 3rd dimension of the map. We define the term freqency ( tf, ) as [53]: i j tf i, j k n i, j n k, j (34) Here, the formla n i,j denotes the nmber of occrrences of the word (t i ) in the text, or docment, d j, and the denominator is the sm of nmber of occrrences of all words in the docment d j. By dividing the total nmbers of docments by the nmber of docments that actally contain the word and then obtaining the logarithm of the qotient, we can arrive at the inverse docment freqency. This is essentially a measre to show how important a specific word is to the whole docment. It is defined as [53]: idf i log 1 D d : t d i (35) 34

48 With: D is the total nmber of texts (docment) in the corps : is the nmber of docments where the word t i shows p (which is n 0 ). If d t d i i, j the word is not in the corps, it wold lead to an nfortnate division-by-zero. Therefore, this is common to se 1 d : t d i And therefore, we have: ( tf idf ) tf idf i, j i, j i (36) Since the actal weights of each word in the dictionary are an interval itself, it also appears as an interval in or map. Therefore, after calclating the TF-IDF for each word, we need to associate all members of that interval with that specific TF-IDF. For example, let s say that we have the word lovers with the distance (0.1314) and the weight [0.2118, ] in the map. Since the weight of the word is an interval itself, it covers 364 points of the map as an interval. So the TF-IDF or weight of the interval shold be applied to all 364 points that this interval in the map covers. Therefore in the 3 rd dimension of the map we fill all 364 points nder that interval with the same TF-IDF. If we have a word in the map with an identical weight and distance in the map, e.g. Distance= and Weight= [0.1300, ], then we disregard the lesser TF-IDF and pt the larger one in the third dimension. After applying the TF-IDF weights to each related interval in the map, we are ready to calclate the centroid of the map. 35

49 3.6. Evalation and Prediction To decide if an is spam or not, after bilding its interval type-2 fzzy map, we will calclate its centroid [see section 2.12.]. Here for calclating the centroid of the map, we consider the map as a two dimensional map. The reason for this is that the centroid formla does not spport three dimensional maps. We begin by calclating the centroid of or 2D map. Then we take the third dimension into the accont. We will explain this in greater detail frther on in this section. After calclating the centroid, the centroid s formla gives s for vales [please see 2.12 section] that are left and right ncertain bondaries. Now for simplicity, we consider the centroid as a horizontal interval in or map which has no clear end point. Instead, each pair of the above mentioned vales are intervals which show the end points of that larger interval. We divide the centroid s domain, which is essentially or distance vector, into three zones. The first zone represents the spam area, and the third zone represents ham area. The second zone is or ncertain area. Figre 3 shows a schematic of the fzzy sbsets. We have come p with these fzzy sbsets throgh trial and error. Based on or experiments, the words that were never seen before sally have distances of more than 0.6. Most of the words that were seen before will have a distance that is less than 0.4. Words with a distance between these two vales are rare, regardless of whether or not they are spam words. The distribtion of the intervals in or maps for several different s led s to come p with these three zones. While in or experiments we did not experience any instances of an ending p in the ncertain zone, we decided to keep it since we might face sch in the ftre. Since we have had a limited nmber of spam s to test, we thoght that it is the best to have ncertain zone. 36

50 Figre 4. A schematic of the Membership Fnction of the System If we have an whose nderlying centroid ends p in second zone, or system woldn t categorize the and leave the decision to the ser. If ser decides that the is a spam then or system will improve itself by adding the new words of that and also changing the weights of existing words [see section 3.2.1]. However, in most cases, we have centroids that are big and they cover more than one zone.; Specifically, since the end points of the centroid are intervals with ncertain bondaries, we have had cases where even the end points are in two different zones. It is very difficlt to decide which centroid belongs to which zone. In addition, the data elements associated with the centroid in the third dimension of the map can belong to more than one zone. We need to se a sophisticated techniqe to separate these associated data sets so we can decide which zone they 37

51 belong to. In cases where the centroid ends p in a single zone we can decide more easily. Challenges arise when we have to decide between two zones. Those third dimension vales are or gide, or criteria, for determining which zone the centroid belongs in. Frthermore, the nmber of the points, or the length, that centroid has in each zone is also or gide. To do so, we employ a fzzy c-means (FCM) clstering techniqe [54, 55]. In fzzy clstering, the data sets cold belong to different clsters where each specific element is a set of membership levels. This fact indicates the strength of the association between the data sets and a specific clster. The process of assigning each one of the data elements to one or more clsters by considering these membership levels is called fzzy clstering. First we define three clsters; each one belongs to one zone (j). Second, we give each point (x) the degree of belonging to each clster (). Third, we repeat the algorithm ntil it covers all points of the centroid (i). Forth, sing the below formla we calclate the centre of each clster (c). The next step will be deciding each point belongs to which. The task is done by minimizing the following fnction: J m N C m x c i1 j1 ij 1 m i i 2, (37) m : any real nmber greater than 1 ij : the degree of membership of x i in the clster j x i : the i th of an n-dimensional measred data c j : n-dimension center of the clster * : similarity between any measred data and the centre. We do the fzzy partitioning by an iterative optimization of or mentioned objective fnction with the pdate of membership ij and the clster centers c j by: 38

52 ij C k1 x i x i 1 c j c k 2 m1 (38) c j N ij i1 N i1 m x m ij i (39) k k The mentioned iteration wold be stopped when 1 max, where is a termination criterion between 0 and 1, whereas k are the iteration steps. This procedre converges to a local minimm or a saddle point of J m [54, 55]. After sing the FCM techniqe, we can finally determine the correct zone for each centroid based on those clsters. ij ij ij 3.7. Dictionary Improvement Spammers always come p with methods for breaking throgh spam-filtering techniqes. An effective Anti-spam system needs to be adaptive in order to combat new spamming techniqes. Or system has an adaptive natre. It has the capacity to adapt to new words that it has never seen before. If spam trend change over time and spammers try to employ new methods of sing words, or system can keep p with these changes by improving its dictionaries reglarly and consistently. We have two different approaches to dictionary improvement: is predicted to be a ham: In this case, the weight of the hit words will get redced atomatically by applying their weight in the bffer in a negative way. When we pt the nmber of instances of the hit word with a negative in the bffer, it affects the average, or mean and so the standard deviation [see section 3.6.2]. 39

53 is predicted to be a spam: In this case, the weight of the hit words will get raised atomatically [see section 3.6.2]. In addition, after exclding all white and the hit words that we have identified from the , we will add all the remaining words of the to or dictionaries. The weight of the new added words will be very low at first, de to the related average and standard deviation. However, their weight will increase over time if we have more hits for them in the ftre. After dictionary improvement, or system s objectives are nearly flfilled. The system then pts the in the inbox folder if the is ham or pts it in spam folder if it is spam Reslts and Analysis Or system ses the same method to detect spam in all ten categories of spam. Here, we shall perform an experiment to demonstrate that or system works as intended. In or experimental evalation, we have sed the total nmber of s inclding 567 s containing spam contents and 1328 legitimate s. Those 567 s belong almost to all types of spam bt mostly the Adlt type. We also have employed the confsion matrix, or contingency table, to evalate or system [56]. In a contingency table, tre positive (TP) denotes the correct classification of a spam where the false positive (FP) denotes incorrect classification of spam. Tre negative (TN) is the correct classification for ham and false negative (FN) denotes the incorrect classification of ham. Table 2 shows or contingency table. 40

54 Predicted by or System Uncertain s = 0 Positive Negative Real s = 1895 Positive TP = 441 FN = 126 Negative FP = 82 TN = 1246 Table 1. Contingency Table Accracy (A cc ) is the most important evalator of any spam detection system. This measre evalates the proportion of correctly classified instances whether to be ham or spam. It is also the general factor of effectiveness of any spam detector system [56]: A cc TP TN TP TN FP FN (40) There are also two other measres which also are eqally important in measring the effectiveness of a spam detection system: Spam Precision (S p ) which denotes the percentage of trly classified messages categorized as spam and Spam Recall (S r ) which denotes the proportion of accrate categorization of real spam messages by the system [56]: S p TP TP FP (41) 41

55 S r TP TP FN (42) Table 3 shows the reslts of or system. Since researchers are more concerned abot hams being incorrectly blocked than they are abot allowing a spam to pass throgh, spam precision is the most important factor in spam filtering. As we can see here, or system has a very good potential to respond to this concern [comparing to 56,57,58,59 and 60]. Total Actal Actal Spam Spam Spam s Spam Ham Accracy Precision Recall % 84.3 % 77.7 % Table 2. Main System Reslts As we know, there is no sch thing called a spam benchmark for the prpose of testing. Spammers are trying to change the trend of their spam day by day and there are almost no two spam s that are 100% similar, at least not in the case of plain text spam, which is what or research is concerned with. Therefore, the reslts that people present for their work in this field may vary dramatically if they se a different data set. The reslts we have here are merely based on the testing data set that we cold provide. Since we have had no access to the sorce code of other text-based spam detection systems with which to test or data set, we decided to test it against two major mail servers: Yahoo Mail Server and Microsoft Mail server, or Hotmail. However, we do know that this test is not one 42

56 hndred percent accrate. These mentioned mail servers se different spam detection methods on many different levels. However, we tried to simplify the test as mch as possible so it is as close as possible to the plain text level. We did not choose to se Gmail, the Google Mail Server, becase we acqired all or spam s from Gmail. Since Gmail had already detected these s as spam, there was no reason to se it again. Based on or past experience, after catching some spam from a single sending accont, Yahoo blocks the sending accont and starts to send most of the s from that particlar accont to the spam folder. That incldes the legitimate s. This shows that Yahoo identifies the sending accont and simply blocks it. This also occrs with Hotmail. The only difference between these two servers is the amont of spam that these two consider critical in order to block the sending accont. We do not know what that critical nmber is or whether other factors are involved in the processes. For the receiving accont only one is enogh. We sed a legitimate receiving accont on each of those mail servers which have been in se for a long time. In order to be as accrate as possible for the sending acconts, we needed to divide or data set into smaller fractions. Each new data set, or fraction, contained a smaller portion of spam and a larger portion of ham. Also, for the prpose of accracy, we attempted to make each new data set be as similar to the other data sets as possible, in terms of the nmber of ham verss spam and the category of spam. In the end, we had twenty new data sets in total. Then, we bilt and sed twenty different sending acconts on each server and then sent those data sets from these sending acconts to that one receiving accont on each mail server. The reslts are shown below. 43

Predicted by Hotmail Positive Negative Real Emails = 1895 Positive TP = 389 FN = 178

57 Predicted by Hotmail Positive Negative Real s = 1895 Positive TP = 389 FN = 178 Negative FP = 213 TN = 1115 Table 3. Hotmail Contingency Table Figre 5. Hotmail Snapshot 44

58 Total Actal Actal Spam Spam Spam s Spam Ham Accracy Precision Recall % 64.4 % 68.6 % Table 4. Hotmail Reslts Predicted by Yahoo Positive Negative Real s = 1895 Positive TP = 428 FN = 139 Negative FP = 317 TN = 1011 Table 5. Yahoo Contingency Table 45

59 Figre 6. Yahoo Snapshot Total Actal Actal Spam Spam Spam s Spam Ham Accracy Precision Recall % 57.4 % 75.4 % Table 6. Yahoo Reslts 46

Networks An introduction to microcomputer networking concepts

Networks An introduction to microcomputer networking concepts Behavior Research Methods& Instrmentation 1978, Vol 10 (4),522-526 Networks An introdction to microcompter networking concepts RALPH WALLACE and RICHARD N. JOHNSON GA TX, Chicago, Illinois60648 and JAMES