Chapter-8. Conclusion and Future Scope

Chapter-8 Conclusion and Future Scope This thesis has addressed the problem of Spam E-mails. In this work a Framework has been proposed. The proposed framework consists of the three pillars which are Legislative measures, Behavioural measures and Technical measures. These three pillars have equal importance to fight against the problem of Spam E- mails. After studying Legislative, Behavioural and Technical measures important conclusions are drawn. These conclusions are included to propose an effective framework for Spam management. This chapter consists of three sections which include conclusion and summary of Legislative, Behavioural and Technological measures in which findings of each measure are summarized. The last section of this chapter focuses on directions for future research. 8.1 Legislative Measures The study of legislative measures is carried out which consist of study of current legislative mechanism implemented all over the world to fight against the problem of Spam E-mails. The parameters such as type of subscription, scope of the subscription, the type of sender as well as receiver and group of possible accusers are considered for this study. In India, no Anti-Spam law and general ID theft laws are implemented but, relevant provisions have been made in the criminal law, which includes the reporting regarding identity theft and related issues. For addressing cyber security and privacy issues several amendments have been made to the Information Technology Act 2000 (IT ACT 2000) which was notified on 17th October, 2000 by the Indian Parliament. In India it is need to have separate Anti-Spam legislation. The summary of study carried out on Legislative measures is as follows:- It is found that, only few countries have enacted on Spam legislation which also includes identity theft legislation. Traditional provisions are also made including fraud, forgery, and cybercrime. In India it is need to have separate Anti-Spam legislation. 86

Different countries are having different legislations with variety of options, the method of investigation including prosecution are also varying in nature. This variation will lead to situation where investigation process of one country will be blocked by another country. So, there is need to have a homogeneous legislation on Spam E-mail all over the World. Lack of reporting mechanism. Only few countries have provided reporting mechanism which are either online or offline. It is advisable that, each country should establish at least one single online reporting mechanism using which samples of Spam E-mails and incidents of Spamming can be reported. Only two metro cities Mumbai and Bangalore in India, is having online mechanism for reporting identity theft, which does not include Spamming. The users should be aware of these reporting mechanisms as well as the provisions of punishment made under Anti-Spam law for the effective implementation of it. The reporting mechanisms should also provide appropriate information to the victims regarding follow-up and action taken so far on the complaints registered by them. The list of Spammers who have been punished for Spamming should be published with wide publicity. The reporting mechanisms would become a useful data collection tool, which can be useful for Content based Filter to understand the current pattern of Spam E- mails for the purpose of updating it. 8.2 Behavioural Measures The study of behavioural measure is carried out with the objective to find out behavioural pattern which may be common in sending Spam E-mails. This pattern found to be useful to set a foundation for technological measures for proposing an Anti-Spam solution. The study of E-mail delivery pattern is carried out. The content analysis of header part and body part has been carried out. The content analysis carried out which has played an important role for the Content based Filter proposed in technological measures. The summary of behavioural study which is mentioned below is used and found very useful while suggesting the technological solution to block Spam E-mail. The E-mails which contains almost all words of subject field or body of an E- mail or both are in uppercase, then it is definitely Spam E-mails. 87

The E-mail which do have subject field empty is definitely Spam. The E-mail which has different domain names in From field and Reply-to field is Spam. Some Spam E-mails contains many E-mail addresses in To field. Presence of many E-mail addresses in To field shows that it is Spam. Many Spam E-mails does not contain E-mail address at To field generally, it is added to CC or BCC field. The Spam E-mail does not contain information in the field Return-path. It is also observed that, some E-mails which has typical words or combination of these in the From field such as NOKIA MOBILE LOTTREY DRAW, Promo Enlargement, BBC NATIONAL LOTTERY, UNITED KINGDOM LOTTERY, COCA COLA DRAW, Free Trial Men s Supplement are Spam. During this behavioural study, some words are identified presence or combination of these words increases the chances of E-mail being Spam. These words are WON POUNDS, job offers, UK-LOTTERY, huge stick, increase your length, desired proportion and size, Customer Survey, WON 500,000GBP, LOAN OFFER!!, WINNING NOTIFICATION..!!, making money, income going down, LOTTERY DRAW, Weight Loss, Diet, WON 750,000.GBP, SEX PILL, Buy Viagra at Half Price, Winner, MyDailyFlog!, HasDonated (,,500,000.GBP) etc. Some Spammers intentionally break-up the words or misspelling the Spam words in order to bypass filtering mechanisms. 8.3 Technological Measures In order to propose technological solution to block the Spam E-mail, initially existing solutions are implemented. The Anti-Spam Framework has been proposed which consists of combination of Origin based Filters with Content based Filters. The Origin based Filter such as White-list and Black-list are implemented. The Challenge Response System which is used to differentiate between human and machines is implemented The drawback of C-R System are solved by proposing the architectures. 88

After studying the content of Spam E-mail in behavioural measure, the process of feature extraction is applied on the standard dataset Enron, LingSpam, PU123A and PEM based on the pattern important features are extracted. The Content based Filters with machine learning based classifiers and semantic similarity with edge based classifier are implemented. The machine learning based classifier including Decision Tree, Rough Sets, k-nearest Neighbor (k-nn) and Support Vector Machine (SVM) are implemented. The Rough Set classifier is implemented with various rule generation methods such as Genetic Algorithm, Learn by Example Method (LEM), Covering Algorithm, and Exhaustive Algorithm. The SVM is implemented with various kernel functions like Linear Kernel, Multi Layer Perceptrons, Quadratic Functions, Radial Basis Function. These classifiers are executed on the extracted features of standard dataset Enron, LingSpam, PU123A, Spambase and on PEM. The frequency of occurrences is the meaningful attribute added to the features which are extracted and it has contributed for improvement in results. The overall performance of SVM using polynomial kernel is high for PU123A, LingSpam and PEM datasets. In the polynomial kernel the degree of polynomial is three and classification categories are two (such as Spam and Ham). The hyper plane formed using SVM Polynomial does binary classification since, input data is linearly separable, therefore the results achieved are promising. During empirical analysis it is found that, accurate feature extraction has reduced the gap between low level features and high level feature of an E-mail. Thus, the accuracy of Spam filtering is improved and Spam misclassification is reduced. The empirical analysis of ML based classifiers shows that, the Naive Bayesian classifier is suitable classifier for the dataset like Enron while, Rough Set with Genetic Algorithm is suitable for the dataset Spambase. The SVM with polynomial kernel outperforms on dataset like LingSpam, PU123A and PEM. The experimental results show that, the ML based classifier is both effective and efficient Anti-Spam filter. The Content based Filter using semantic similarity with edge based classifier is implemented with the intent to improve the results of machine learning classifier. The results are compared with machine learning based classifier. Table-7.8 shows that, semantic similarity with edge based approach outperforms ML based classifiers with misclassification such as false positive and false negative is almost zero. 89

The Content based Filter is made adaptive in nature to improve the accuracy of filter during the course of time. It has proved that, semantic relationship specifically synonyms plays an important role in Spam classification. The semantic similarity with edge based classifier has advantage that, it do not depend on the corpora. The experimental results outperform previous machine learning based classifiers also it has reduced the misclassification. The overall analysis shows that, Naive Bayesian, SVM with Polynomial Kernel and Semantic Similarity with Edge based approach classifier are promising techniques that can be applied to fight against the problem of Spam E-mails. Finally, the combination of Origin based Filter with Content based Filter would produce the optimal results The results clearly demonstrate that, the proposed Anti- Spam Framework can effectively filter the Spam E-mail with very less misclassification (as 100 % classification is impossible) since, it is adaptive in nature. 8.4 Future Scope Though, thesis has made efforts towards solving the problem of Spam E-mail using legislative, behavioural and technological measures, the solution proposed are not complete solutions. The problem of Spam E-mail and Anti-Spam solution is game of cat and mouse since, every day Spammer will come up with new techniques of sending Spam E-mails. This work has given the potential direction for classification of the Spam E-mails. The future efforts would be extended towards: Achieving accurate classification, with zero percent (0%) misclassification of Ham E-mail as Spam and Spam E-mail as Ham. The efforts would be applied to block Phishing E-mails, which carries the phishing attacks and now-days which is more matter of concern. Also, the work can be extended to keep away the Denial of Service attack (DoS) which has now, emerged in Distributed fashion called as Distributed Denial of Service Attack (DDoS). 90