Information Extraction from Spam s using Stylistic and Semantic Features to Identify Spammers

Size: px

Start display at page:

Download "Information Extraction from Spam s using Stylistic and Semantic Features to Identify Spammers"

Geraldine Chandler
6 years ago
Views:

1 Information Extraction from Spam using Stylistic and Semantic Features to Identify Spammers Soma Halder University of Alabama at Birmingham Richa Tiwari University of Alabama at Birmingham Alan Sprague University of Alabama at Birmingham Abstract Traditional anti spamming methods filter spam s and prevent them from entering the inbox but take no measures to trace spammers and penalize them. This paper uses Natural Language Processing and Machine Learning techniques to the spam s from the same spammer based on the content and the style of the spam. Spam s from different sources and for a varied period of time are studied with two set of features: stylistic and semantic. Three sets of ing are performed: ing based on stylistic feature, ing based on semantic feature and ing based on combined feature. The s formed for different conventional ing algorithms are compared and evaluated. Spam s from the same sources have similarities and together. Spam s have URLs of the WebPages that the spammer is trying to promote. Clusters are mapped to the internet protocol (IP) of these URLs and the whois information of the IP addresses help to get the source of spam. Keywords: Spam; semantics; stylistics, machine learning, ing 1. Introduction Spam s are a major security concern, not only do they serve as means of earning illegal profit by selling pharmaceutical products, replica watches, sexual enhancers etc. but also they spread malwares that either infest the receiver computer as bots or rob credentials from the user. Spam s eat up huge amount of network bandwidth and computer memory and is one of the most important concerns of computer forensic experts. Various methods have been adopted to combat spam s from entering the inboxes, but they are just temporary solutions to the grave problem. The most effective resistance against spam would be to track the spammers or the master minds behind such spam campaigns and prosecute them. The criminal or the spammers stay connected to servers called the command and control servers(c&c server). These servers are called so because they are based on the topology of command and control. The C&C servers are used to send spam s to a few computers with malwares, the malwares if executed set up the recipient s computer as a bot and can be used to send spam. These bots remain hooked to the C&C server which is operated by the Primary Spammer or the Bad Guys (As shown in figure: 1). The primary sender updates information in the C&C server from time to time and this gets reflected to the victim s computer. This leads to the conclusion that the spam originating from the same criminal will have the same style of writing in the irrespective of whatever bots they use to propagate spam. Data mining techniques, like Clustering and Classification, can be utilized to find similarities among these spam s. A very important requirement for such approaches is choosing the correct set of features that can help in distinguishing the s. In this work, we consider text in the body and style of the s as the determining factors for detecting the similarity and hence the source of the spam. Text from the body helps to determine the semantic attributes of the and the style of the s is predicted by stylistic features like use of punctuations, contractions etc. Once we have the ed spam, IP addresses of the web links the s of a particular can be verified to track the information of the agency hosting the site and legal procedures can be taken. Thus these approaches of ing spam s not only ensembles spam with similar content and style but actually s from different campaigns under a single roof. We divide this paper into four main sections, the second section does a background study of the different spam prevention and detection techniques, the third

2 section introduces our procedure and the last section analyzes and compares results for different ing algorithms used in the two different phases of ing. Figure 1. Diagram representing the distribution of the Spam Network 2. Related Work Use of machine learning and statistical techniques for spam classification is a traditional approach. The oldest and most conventional anti spamming technique, Bayesian filtering method [1] used probability of the recurring words to decide whether an is spam or ham (legitimate). Other techniques of such content based spam filtering include use Support Vector Machines [2], Active Learning and Random Forests [3] and Neural Network [4]. But spam filters only prevent spam from entering the boxes; they take no action to track the sender of the . Spammers figure out various obfuscation techniques to bypass these filters and continue to flood mail boxes with unsolicited s. Therefore researchers today are more interested in studying the commonalities between different spamming tendencies of spammers. They believe that it is more effective to eliminate the source of spam by taking legal action against the spammer rather than just filtering those s. This paper is motivated by the same idea and we use semantic and stylistic procedures similar to those that are used in the fields of authorship attribution and genre classification. Li et al [5] in their study of spam over a period of time found that spam s generally arrive in bunches and are similar to one another either in their prototype or by the URLs them. The main conclusion of their research was that all the different spam campaigns around the world are unified under small group of spammers. Chun et al [6] studied similar tendencies of spammer behavior and concluded that ing similar s together based on the subject of the and the similarities in IP address can be used as an effective way to track spammers. They studied the spamming pattern of 350,394 of the total of 638,678 s that they collected for the period of one month from June-July Their results were based on the s that had subject similarity and top 5 daily s 83% of the total number of s. Their top s had IP addresses that pointed to IP hosts in China and had as many as 3427 domain names registered under the same IP address. Guerra & Pires [7] have been studying Brazilian spamming patterns for a few years and they use the decision tree approach to find similarities in spam s. Their spam miner places s in the decision tree based on four main attributes language, layout, web link in the and the message type. The frequencies of each of these attributes are computed to place the in the right node. The depth of the tree decides the rate of recurrence of message. A recent technique used in this area is the method proposed by Zhang and Yang [8]. They consider the text in the body and compute word similarity and sentence similarity. Word similarity is computed based on the concept similarity of the words with HowNet (a knowledge base). Sentence similarity is dependent on the word similarity of the words that form the sentence. Similarity between the semantic bodies depends on the concept vector-space, formed by the combination of words and sentences. Whenever a spam is received it gets grouped according to the concept the content of the message. Semantic and Stylistic based spam ing has similarity with authorship attribution techniques used in web forums. Work done in [9] uses syntactic, semantic and stylistic classification to distinguish posts written by the same author. The method gave accuracy of about 90% and proved efficient for text of considerable length. Our method is similar to [9] but one challenge that we face is that often contents are too short and this makes the task more difficult and often decreases the level of accuracy that can be achieved as compared to documents of considerable length. 3. Methodology As mentioned earlier, we consider ing s based on the stylistic and semantic features of the s. In this work, our methodology can be divided into seven steps as shown in Figure Data Collection For this work we have collected the data from UAB spam database. The database collects data from different recipients after they report the particular to be spam. The database also contains data collected directly from

mail servers that have catch all email accounts.

3 mail servers that have catch all accounts. Whenever a mail server receives an that has been sent to a non-existent s accounts they are forwarded to a default account [10] called the catch all account. We assume that any s reaching this account are spam. The initial dataset has approximately 10,000 spam s containing spam of all categories. Figure 2. Block diagram representing different stages of our methodology 3.2. Preprocessing This stage primarily involves cleaning of the data. Here we removed all s that were in any other language other than English. that had only attachments or web links (urls) in them were removed. We considered all s that had at least one line of text with more than 4 words. Preprocessing and data cleaning left us with count of approximately s i.e. 25% of the original data collected. We divided this into four data sets of different sizes (200, 700, 1300 and s in each set) and performed ing on them Feature Extraction This is a very important step in any ing process. Here we try to identify the features that can help in ing similar documents together. In our case, we would like to spam s based on the style and content of the s. Hence we divide our features into two main categories: the stylistic and the semantic features as describe below Stylistic features. Stylistic features are based on the style in which the is composed as s generated from the same botnet or written by the same spammer should have similarities in writing style. The main idea behind propagating a spam campaign is to pursue the users to buy a product or infect them with malwares. For this purpose these s invariably have a URL or id injected in the body and this can be used as a stylistic feature of the . Obfuscation or deliberate misspelling of words (Example: hello written as he110) is a common practice adapted by spammers to bypass the filters [10], hence the number of obfuscated or alphanumeric words is also a good feature. We consider a list of 57 stylistics features. The features are: 1. The total word count of the text in the 2. The number of new lines the 3. The total count of the punctuation used in the body 4. The total number of contractions the 5. The total number of obfuscated words the 6. The total number of ids the 7. The total number of URLs the body of the 8. The count of different punctuations used in the . We prepared a list of 50 different punctuations and calculated the frequency of appearance each of them in the body Semantic feature. By semantic features, we refer to the features that give us insight about the content or semantic meaning of the s. We have used the two classes of semantic features the Tf-Idf (Term frequency- Inverse Document frequency) for the top x most frequent words used in our dataset and the count of the top x bigrams used in the dataset, where x is the number that is decided based upon the cutoff of the minimum frequency count. In our case x usually varied from Tf-Idf is a statistical measure that can be used to represent the importance of a term in a document [11]. Before computing Tf-IDf we first remove all the stop words from the s. Stop words are the words such as articles,

4 which do not add any meaning to the semantics of a document. It is a common practice to remove stop words from the text before processing. Once we have the stop words removed, Tf-Idf can be calculated in three steps. Equation 1, calculates the term frequency (tf) of each term in a given document. tf i, j = n n i, j k, j k (1) Where, tf i,j is the term frequency of the term i in document j. n i,j is the number of times the term i occurred in the document j and the n k,j is the sum of the number of occurrences of all the terms in the document j. k in the above equation is any term in the document j. Equation 2 gives the general importance of a term in the whole corpus by dividing the total number of documents by the number of documents containing the term. idf stands for inverse document frequency, D is the total number of documents in the data set and the denominator of equation 2 is the number of documents where a term t i appears. idf i = log D { d : t d} i (2) Finally, Tf-Idf is the product of the results obtained in Equation 1 and 2 and is given by Equation 3. ( tf Idf ) = tf i j idf i i j,, (3) Bigram is referred to as any two words that occur consecutively in a document. If we know the most frequent bigrams used in a document, it helps us to do a better analysis of its content. For our second class of semantic feature, we first prepared a corpus of top x most frequently used bigrams from the whole data set and then made a feature vector that keeps the count of those bigram occurrences for each of the documents. x is the cut off value that is based on a pre decided minimum frequency of the total bigrams. Again x is varies from in our case Combined Feature. We take a combination of both the stylistic features and the semantic features. Once we have these features extracted for each document in our data set, we proceed to the ing step Clustering As mentioned in the previous step, we extracted stylistic and semantic features separately for the spam s. We wanted to see the outcome of each of these feature sets separately as well as combined on our ing results. Consequently, we performed three different sets of ing on our data. We also used two different kinds of ing algorithms in our experiment- K-means and Expectation Maximization (EM). In this work we used Weka implementation of all these three algorithms [12]. We used these two algorithms because we wanted to test the different approaches to ing the partition ing approach (K-Means) and the unsupervised method (EM) in which we do not provide the ing algorithm with any number of s to the data into [13]. Therefore, we show three different ing results for each dataset on the two different algorithms Cluster Evaluation Once we have the s, we evaluate them based on the ground truth with the data that was manually collected. We calculate the of the overall ing technique and present the results in the next Section. was calculated using Equation 4. We also analyze the s individually and present the result of the that gave us the highest accuracy for each of the algorithms and each of the feature sets. #of correctly ed Instances In each class (%) = total number of instances (4) While calculating the, we ignored those s that were assigned very few instances in our experiments. We had a minimum threshold 8% i.e. at least 8% of the total s should be the so that we can consider it to be valid. We do this because we want to avoid false alarms given by singleton or small s claiming they are 100% pure Mapping Cluster to Domain Name / IP address retrieval IP addresses of the URLs embedded in the s are fetched from the web links/ domain names and stored in the IP table with proper Message ID corresponding to each in the s. The WHOIS information of the IP address can be found out once mapping has been done. WHOIS information retrieval is the process by which domain registration information is fetched. This can help identify web servers that host large number of spam websites. However one issue with finding IP address is that they have to be found in real time because spam campaigns are generally active for a limited amount of time, after which the domain becomes inactive or domain get blacklisted. When done in real time this information helps cybercrime investigators and they can take appropriate action against them.

5 4. Results In this section we present our results based on the three different types of ing. Table 1 shows the ing accuracy of the K-means algorithm on our dataset. We can see from the table that the of the combined feature set is better than both the stylistic and the semantic features individually. Similarly Table 2 shows the results of the EM algorithm. We can also see from the columns 3, 6 and 9 of Table 1 and 2, that although the overall of the algorithms in any dataset is not high, however the that performed the best still had a high. Stylistic ing gave good results when the length was short (i.e. where the total count of words was low). marked by distinguishing punctuations like a sentence always ending with an exclamation mark (!) or question mark (?) or style are easy to identify using this features. Semantic s give good results when the semantic body is rich in content. Hence we can say that the length of the s also affect the type of distinguishing features. For example, in s dataset the stylistic ing gave better results than semantic because many of the s in that dataset were smaller in textual content and not enough to be distinguished by the semantic ing approach. However, the combined balances the effect of both. From the table below, we can see that the data set of s produces better results than s data set. This is because of the following two reasons. Firstly, the additional s that were added to the previous s were mostly larger in their text content and this made it easier to extract meaningful data from them for ing. Secondly, most of the s in that set originally belonged to one and had a similar pattern in their style of writing. Because of the above two reasons, we also saw a vast improvement in the overall accuracy of K-means algorithm in ing that data set. On doing IP address mapping of the s we were successful to get results for only the last 1300 set of s because most of the domains of the previous 1300 data set were inactive by then. These s mapped to 26 IP unique addresses and we could get the WHOIS information for these addresses. The WHOIS information for these showed that the address were prevalently from countries like USA, China and Russia. 5. Conclusion In this paper we present a stylistic and semantic way of ing spam s to identify the spam of similar types. We experimented with different number of data set and two different ing algorithms and observed that K-means performed the best. We can use this methodology to identify writing styles of each spam campaign from each spammer and can make a prototype from it that can be used for future identification of spam s of similar types. in the same generally point to the same campaign. Therefore s with not very purities can point to the leading spam tendencies for the period. In the future, we would like to experiment with various other features such as, subjects, sender ID, URLs the s, number and type of attachments, etc. that can help us in further improving our accuracy. We would like to perform a feature analysis for this task. It is very important to analyze which features can give a better output in this area because of the following reasons. Firstly, the amount of data or s is of vast amount and the feature extraction and ing can take a long time. It will be useful to know which of the features can be more important and can help save computational time. Secondly, we deal with real time data which means that we need something that is fast and generalized. The spam s keep on changing and the technique to them needs to be something that can adopt to those changes. If we can come up with a set of features that give good results and can work on different kind of s, we could help the computer forensics experts to get to the main. Table 1. Table showing the of Clusters using K-mean algorithm on our data set Data set Stylistic Cluster Semantic Cluster Combined Cluster # of Obtained # of the with highest Obtained the with highest Obtained the with highest % 100.% % 100% % 100% % 100% % 100% % 100% 70

6 % 83.2% % 96.0% % 98.8% % 84.0% % 76.7% % 100% 1577 Table 2. Table showing the of Clusters using EM algorithm on our data set Data set Stylistic Cluster Semantic Cluster Combined Cluster # of # of the with highest # of the with highest the with highest % 56.1% % 90.0% % 96.1% % 83.3% % 63.0% % 60.2% % 98.8% % 91.0% % 96.6% % 99.0% % 89.0% % 79.0% References [1] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A Baysian Approach to Filtering Junk .In Proc. of AAAI-98 Workshop on Learning for Text Categorization.,USA, [2] C. Liu, Experiments on Spam Detection with Boosting, SVM and Naïve Bayes, UCSC. [3] D. Debarr, H. Wechsler, Spam Detection using Clustering, Random Forests and Active Learning. In Proc. of the 6 th Conf. on and Anti-Spam, CEAS, USA [4] F. Tzeng, K. Ma, Opening the Black Box :Data Driven Visualization of Neural Networks. Visualization, VIS 05. IEEE.Page [5] F. Li, M. Hseieh, An Emperical Study of Clustering Behavior of Spammers and Group Based Anti Spam Strategies. In Proc. of the 3 rd Conf. on and Anti- Spam,USA, [6] C. Wei, A.P. Sprague, G. Warner, and A. Skjellum. Clustering spam domains and targeting spam origin for forensic analysis, J. Digital Forensics, Security, and Law, Vol:5, Campaigns, In Proc. of the 6 th Conf. on and Anti- Spam,CEAS,USA, [8] Q. Zhang, H. Yang, Z. Yuan, J. Sun, Studies on the Semantic Body-Based Spam Filtering, In Proc. of the Intl. Conf. of Information Science and Management Engineering, [9] S. Pillay, T. Solorio, Authorship Attribution of Web Forum Posts, In Proc. of the ecrime Researchers Summit, USA, [10] C. Liu, S. Stamm, Fighting Unicode Obfuscated Spam, In Proc. of the anti-phishing working groups 2nd annual ecrime Researchers Summit, USA, [11] G. Salton, M. J. McGill. Introduction to modern information retrieval, McGraw-Hill. [12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The WEKA data mining software: An update, SIGKDD Explorations, Volume 11, Issue 1, [13] P. Tan, Steinbach, M., V. Kumar, Introduction to Data Mining, (First Edition), Addison-Wesley Longman Publishing Co., [7] P. Guerra, D. Pires, D. Guedes, Spam Miner: A Platform for Detecting and Characterizing Spam

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University