Information Extraction from Spam s using Stylistic and Semantic Features to Identify Spammers

Information Extraction from Spam using Stylistic and Semantic Features to Identify Spammers Soma Halder University of Alabama at Birmingham soma@cis.uab.edu Richa Tiwari University of Alabama at Birmingham rtiwari@cis.uab.edu Alan Sprague University of Alabama at Birmingham sprague@cis.uab.edu Abstract Traditional anti spamming methods filter spam emails and prevent them from entering the inbox but take no measures to trace spammers and penalize them. This paper uses Natural Language Processing and Machine Learning techniques to the spam emails from the same spammer based on the content and the style of the spam. Spam emails from different sources and for a varied period of time are studied with two set of features: stylistic and semantic. Three sets of ing are performed: ing based on stylistic feature, ing based on semantic feature and ing based on combined feature. The s formed for different conventional ing algorithms are compared and evaluated. Spam emails from the same sources have similarities and together. Spam emails have URLs of the WebPages that the spammer is trying to promote. Clusters are mapped to the internet protocol (IP) of these URLs and the whois information of the IP addresses help to get the source of spam. Keywords: Spam; semantics; stylistics, machine learning, ing 1. Introduction Spam emails are a major security concern, not only do they serve as means of earning illegal profit by selling pharmaceutical products, replica watches, sexual enhancers etc. but also they spread malwares that either infest the receiver computer as bots or rob credentials from the user. Spam emails eat up huge amount of network bandwidth and computer memory and is one of the most important concerns of computer forensic experts. Various methods have been adopted to combat spam emails from entering the inboxes, but they are just temporary solutions to the grave problem. The most effective resistance against spam would be to track the spammers or the master minds behind such spam campaigns and prosecute them. The criminal or the spammers stay connected to servers called the command and control servers(c&c server). These servers are called so because they are based on the topology of command and control. The C&C servers are used to send spam emails to a few computers with malwares, the malwares if executed set up the recipient s computer as a bot and can be used to send spam. These bots remain hooked to the C&C server which is operated by the Primary Spammer or the Bad Guys (As shown in figure: 1). The primary sender updates information in the C&C server from time to time and this gets reflected to the victim s computer. This leads to the conclusion that the spam originating from the same criminal will have the same style of writing in the email irrespective of whatever bots they use to propagate spam. Data mining techniques, like Clustering and Classification, can be utilized to find similarities among these spam emails. A very important requirement for such approaches is choosing the correct set of features that can help in distinguishing the emails. In this work, we consider text in the body and style of the emails as the determining factors for detecting the similarity and hence the source of the spam. Text from the email body helps to determine the semantic attributes of the email and the style of the emails is predicted by stylistic features like use of punctuations, contractions etc. Once we have the ed spam, IP addresses of the web links the emails of a particular can be verified to track the information of the agency hosting the site and legal procedures can be taken. Thus these approaches of ing spam emails not only ensembles spam with similar content and style but actually emails from different campaigns under a single roof. We divide this paper into four main sections, the second section does a background study of the different spam prevention and detection techniques, the third

section introduces our procedure and the last section analyzes and compares results for different ing algorithms used in the two different phases of ing. Figure 1. Diagram representing the distribution of the Spam Network 2. Related Work Use of machine learning and statistical techniques for spam classification is a traditional approach. The oldest and most conventional anti spamming technique, Bayesian filtering method [1] used probability of the recurring words to decide whether an email is spam or ham (legitimate). Other techniques of such content based spam filtering include use Support Vector Machines [2], Active Learning and Random Forests [3] and Neural Network [4]. But spam filters only prevent spam from entering the email boxes; they take no action to track the sender of the email. Spammers figure out various obfuscation techniques to bypass these filters and continue to flood mail boxes with unsolicited emails. Therefore researchers today are more interested in studying the commonalities between different spamming tendencies of spammers. They believe that it is more effective to eliminate the source of spam by taking legal action against the spammer rather than just filtering those emails. This paper is motivated by the same idea and we use semantic and stylistic procedures similar to those that are used in the fields of authorship attribution and genre classification. Li et al [5] in their study of spam over a period of time found that spam emails generally arrive in bunches and are similar to one another either in their prototype or by the URLs them. The main conclusion of their research was that all the different spam campaigns around the world are unified under small group of spammers. Chun et al [6] studied similar tendencies of spammer behavior and concluded that ing similar emails together based on the subject of the email and the similarities in IP address can be used as an effective way to track spammers. They studied the spamming pattern of 350,394 of the total of 638,678 emails that they collected for the period of one month from June-July 2009. Their results were based on the emails that had subject similarity and top 5 daily s 83% of the total number of emails. Their top s had IP addresses that pointed to IP hosts in China and had as many as 3427 domain names registered under the same IP address. Guerra & Pires [7] have been studying Brazilian spamming patterns for a few years and they use the decision tree approach to find similarities in spam emails. Their spam miner places emails in the decision tree based on four main attributes language, layout, web link in the email and the message type. The frequencies of each of these attributes are computed to place the email in the right node. The depth of the tree decides the rate of recurrence of message. A recent technique used in this area is the method proposed by Zhang and Yang [8]. They consider the text in the email body and compute word similarity and sentence similarity. Word similarity is computed based on the concept similarity of the words with HowNet (a knowledge base). Sentence similarity is dependent on the word similarity of the words that form the sentence. Similarity between the semantic bodies depends on the concept vector-space, formed by the combination of words and sentences. Whenever a spam is received it gets grouped according to the concept the content of the message. Semantic and Stylistic based spam ing has similarity with authorship attribution techniques used in web forums. Work done in [9] uses syntactic, semantic and stylistic classification to distinguish posts written by the same author. The method gave accuracy of about 90% and proved efficient for text of considerable length. Our method is similar to [9] but one challenge that we face is that often email contents are too short and this makes the task more difficult and often decreases the level of accuracy that can be achieved as compared to documents of considerable length. 3. Methodology As mentioned earlier, we consider ing emails based on the stylistic and semantic features of the emails. In this work, our methodology can be divided into seven steps as shown in Figure 2. 3.1. Data Collection For this work we have collected the data from UAB spam database. The database collects data from different recipients after they report the particular email to be spam. The database also contains data collected directly from

mail servers that have catch all email accounts. Whenever a mail server receives an email that has been sent to a non-existent emails accounts they are forwarded to a default account [10] called the catch all account. We assume that any emails reaching this account are spam. The initial dataset has approximately 10,000 spam emails containing spam of all categories. Figure 2. Block diagram representing different stages of our methodology 3.2. Preprocessing This stage primarily involves cleaning of the data. Here we removed all emails that were in any other language other than English. that had only attachments or web links (urls) in them were removed. We considered all emails that had at least one line of text with more than 4 words. Preprocessing and data cleaning left us with email count of approximately 2600 emails i.e. 25% of the original data collected. We divided this into four data sets of different sizes (200, 700, 1300 and 2600 emails in each set) and performed ing on them. 3.3. Feature Extraction This is a very important step in any ing process. Here we try to identify the features that can help in ing similar documents together. In our case, we would like to spam emails based on the style and content of the emails. Hence we divide our features into two main categories: the stylistic and the semantic features as describe below - 3.3.1. Stylistic features. Stylistic features are based on the style in which the email is composed as emails generated from the same botnet or written by the same spammer should have similarities in writing style. The main idea behind propagating a spam campaign is to pursue the users to buy a product or infect them with malwares. For this purpose these emails invariably have a URL or email id injected in the body and this can be used as a stylistic feature of the email. Obfuscation or deliberate misspelling of words (Example: hello written as he110) is a common practice adapted by spammers to bypass the filters [10], hence the number of obfuscated or alphanumeric words is also a good feature. We consider a list of 57 stylistics features. The features are: 1. The total word count of the text in the email 2. The number of new lines the email 3. The total count of the punctuation used in the email body 4. The total number of contractions the email 5. The total number of obfuscated words the email 6. The total number of email ids the email 7. The total number of URLs the body of the email 8. The count of different punctuations used in the email. We prepared a list of 50 different punctuations and calculated the frequency of appearance each of them in the email body. 3.3.2. Semantic feature. By semantic features, we refer to the features that give us insight about the content or semantic meaning of the emails. We have used the two classes of semantic features the Tf-Idf (Term frequency- Inverse Document frequency) for the top x most frequent words used in our dataset and the count of the top x bigrams used in the dataset, where x is the number that is decided based upon the cutoff of the minimum frequency count. In our case x usually varied from 30-60. Tf-Idf is a statistical measure that can be used to represent the importance of a term in a document [11]. Before computing Tf-IDf we first remove all the stop words from the emails. Stop words are the words such as articles,

which do not add any meaning to the semantics of a document. It is a common practice to remove stop words from the text before processing. Once we have the stop words removed, Tf-Idf can be calculated in three steps. Equation 1, calculates the term frequency (tf) of each term in a given document. tf i, j = n n i, j k, j k (1) Where, tf i,j is the term frequency of the term i in document j. n i,j is the number of times the term i occurred in the document j and the n k,j is the sum of the number of occurrences of all the terms in the document j. k in the above equation is any term in the document j. Equation 2 gives the general importance of a term in the whole corpus by dividing the total number of documents by the number of documents containing the term. idf stands for inverse document frequency, D is the total number of documents in the data set and the denominator of equation 2 is the number of documents where a term t i appears. idf i = log D { d : t d} i (2) Finally, Tf-Idf is the product of the results obtained in Equation 1 and 2 and is given by Equation 3. ( tf Idf ) = tf i j idf i i j,, (3) Bigram is referred to as any two words that occur consecutively in a document. If we know the most frequent bigrams used in a document, it helps us to do a better analysis of its content. For our second class of semantic feature, we first prepared a corpus of top x most frequently used bigrams from the whole data set and then made a feature vector that keeps the count of those bigram occurrences for each of the documents. x is the cut off value that is based on a pre decided minimum frequency of the total bigrams. Again x is varies from 30-60 in our case. 3.3.3. Combined Feature. We take a combination of both the stylistic features and the semantic features. Once we have these features extracted for each document in our data set, we proceed to the ing step. 3.4. Clustering As mentioned in the previous step, we extracted stylistic and semantic features separately for the spam emails. We wanted to see the outcome of each of these feature sets separately as well as combined on our ing results. Consequently, we performed three different sets of ing on our data. We also used two different kinds of ing algorithms in our experiment- K-means and Expectation Maximization (EM). In this work we used Weka implementation of all these three algorithms [12]. We used these two algorithms because we wanted to test the different approaches to ing the partition ing approach (K-Means) and the unsupervised method (EM) in which we do not provide the ing algorithm with any number of s to the data into [13]. Therefore, we show three different ing results for each dataset on the two different algorithms. 3.5. Cluster Evaluation Once we have the s, we evaluate them based on the ground truth with the data that was manually collected. We calculate the of the overall ing technique and present the results in the next Section. was calculated using Equation 4. We also analyze the s individually and present the result of the that gave us the highest accuracy for each of the algorithms and each of the feature sets. #of correctly ed Instances In each class (%) = total number of instances (4) While calculating the, we ignored those s that were assigned very few instances in our experiments. We had a minimum threshold 8% i.e. at least 8% of the total emails should be the so that we can consider it to be valid. We do this because we want to avoid false alarms given by singleton or small s claiming they are 100% pure. 3.6. Mapping Cluster to Domain Name / IP address retrieval IP addresses of the URLs embedded in the emails are fetched from the web links/ domain names and stored in the IP table with proper Message ID corresponding to each email in the s. The WHOIS information of the IP address can be found out once mapping has been done. WHOIS information retrieval is the process by which domain registration information is fetched. This can help identify web servers that host large number of spam websites. However one issue with finding IP address is that they have to be found in real time because spam campaigns are generally active for a limited amount of time, after which the domain becomes inactive or domain get blacklisted. When done in real time this information helps cybercrime investigators and they can take appropriate action against them.

4. Results In this section we present our results based on the three different types of ing. Table 1 shows the ing accuracy of the K-means algorithm on our dataset. We can see from the table that the of the combined feature set is better than both the stylistic and the semantic features individually. Similarly Table 2 shows the results of the EM algorithm. We can also see from the columns 3, 6 and 9 of Table 1 and 2, that although the overall of the algorithms in any dataset is not high, however the that performed the best still had a high. Stylistic ing gave good results when the email length was short (i.e. where the total count of words was low). marked by distinguishing punctuations like a sentence always ending with an exclamation mark (!) or question mark (?) or style are easy to identify using this features. Semantic s give good results when the semantic body is rich in content. Hence we can say that the length of the emails also affect the type of distinguishing features. For example, in 1300 emails dataset the stylistic ing gave better results than semantic because many of the emails in that dataset were smaller in textual content and not enough to be distinguished by the semantic ing approach. However, the combined balances the effect of both. From the table below, we can see that the data set of 2600 emails produces better results than 1300 emails data set. This is because of the following two reasons. Firstly, the additional 1300 emails that were added to the previous 1300 emails were mostly larger in their text content and this made it easier to extract meaningful data from them for ing. Secondly, most of the emails in that set originally belonged to one and had a similar pattern in their style of writing. Because of the above two reasons, we also saw a vast improvement in the overall accuracy of K-means algorithm in ing that data set. On doing IP address mapping of the s we were successful to get results for only the last 1300 set of emails because most of the domains of the previous 1300 data set were inactive by then. These 1300 emails mapped to 26 IP unique addresses and we could get the WHOIS information for these addresses. The WHOIS information for these showed that the address were prevalently from countries like USA, China and Russia. 5. Conclusion In this paper we present a stylistic and semantic way of ing spam emails to identify the spam of similar types. We experimented with different number of data set and two different ing algorithms and observed that K-means performed the best. We can use this methodology to identify writing styles of each spam campaign from each spammer and can make a prototype from it that can be used for future identification of spam emails of similar types. in the same generally point to the same campaign. Therefore s with not very purities can point to the leading spam tendencies for the period. In the future, we would like to experiment with various other features such as, email subjects, sender ID, URLs the emails, number and type of attachments, etc. that can help us in further improving our accuracy. We would like to perform a feature analysis for this task. It is very important to analyze which features can give a better output in this area because of the following reasons. Firstly, the amount of data or emails is of vast amount and the feature extraction and ing can take a long time. It will be useful to know which of the features can be more important and can help save computational time. Secondly, we deal with real time data which means that we need something that is fast and generalized. The spam emails keep on changing and the technique to them needs to be something that can adopt to those changes. If we can come up with a set of features that give good results and can work on different kind of emails, we could help the computer forensics experts to get to the main. Table 1. Table showing the of Clusters using K-mean algorithm on our data set Data set Stylistic Cluster Semantic Cluster Combined Cluster # of Obtained # of the with highest Obtained the with highest Obtained the with highest 200 69.7% 100.% 28 85.3% 100% 28 72.2% 100% 28 700 69.3% 100% 294 63.7% 100% 56 64.1% 100% 70

1300 61.1% 83.2% 234 40.0% 96.0% 182 57.4% 98.8% 182 2600 64.3% 84.0% 1333 70.3% 76.7% 1211 80.0% 100% 1577 Table 2. Table showing the of Clusters using EM algorithm on our data set Data set Stylistic Cluster Semantic Cluster Combined Cluster # of # of the with highest # of the with highest the with highest 200 50.2% 56.1% 122 84.3% 90.0% 92 75.6% 96.1% 104 700 61% 83.3% 378 63.0% 63.0% 700 58.2% 60.2% 609 1300 31.5% 98.8% 143 29.8% 91.0% 182 33.3% 96.6% 182 2600 51.8% 99.0% 1576 84.6% 89.0% 1762 57.4% 79.0% 1995 6. References [1] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A Baysian Approach to Filtering Junk Email.In Proc. of AAAI-98 Workshop on Learning for Text Categorization.,USA, 1998. [2] C. Liu, Experiments on Spam Detection with Boosting, SVM and Naïve Bayes, UCSC. [3] D. Debarr, H. Wechsler, Spam Detection using Clustering, Random Forests and Active Learning. In Proc. of the 6 th Conf. on Email and Anti-Spam, CEAS, USA 2009. [4] F. Tzeng, K. Ma, Opening the Black Box :Data Driven Visualization of Neural Networks. Visualization, 2005. VIS 05. IEEE.Page 383-390. [5] F. Li, M. Hseieh, An Emperical Study of Clustering Behavior of Spammers and Group Based Anti Spam Strategies. In Proc. of the 3 rd Conf. on Email and Anti- Spam,USA, 2006. [6] C. Wei, A.P. Sprague, G. Warner, and A. Skjellum. Clustering spam domains and targeting spam origin for forensic analysis, J. Digital Forensics, Security, and Law, Vol:5, 2010. Campaigns, In Proc. of the 6 th Conf. on Email and Anti- Spam,CEAS,USA, 2008. [8] Q. Zhang, H. Yang, Z. Yuan, J. Sun, Studies on the Semantic Body-Based Spam Filtering, In Proc. of the Intl. Conf. of Information Science and Management Engineering, 2010. [9] S. Pillay, T. Solorio, Authorship Attribution of Web Forum Posts, In Proc. of the ecrime Researchers Summit, USA, 2010. [10] C. Liu, S. Stamm, Fighting Unicode Obfuscated Spam, In Proc. of the anti-phishing working groups 2nd annual ecrime Researchers Summit, USA, 2007. [11] G. Salton, M. J. McGill. Introduction to modern information retrieval, McGraw-Hill. [12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The WEKA data mining software: An update, SIGKDD Explorations, Volume 11, Issue 1, 2009. [13] P. Tan, Steinbach, M., V. Kumar, Introduction to Data Mining, (First Edition), Addison-Wesley Longman Publishing Co., 2005. [7] P. Guerra, D. Pires, D. Guedes, Spam Miner: A Platform for Detecting and Characterizing Spam