Information Extraction from Spam s using Stylistic and Semantic Features to Identify Spammers
|
|
- Geraldine Chandler
- 6 years ago
- Views:
Transcription
1 Information Extraction from Spam using Stylistic and Semantic Features to Identify Spammers Soma Halder University of Alabama at Birmingham Richa Tiwari University of Alabama at Birmingham Alan Sprague University of Alabama at Birmingham Abstract Traditional anti spamming methods filter spam s and prevent them from entering the inbox but take no measures to trace spammers and penalize them. This paper uses Natural Language Processing and Machine Learning techniques to the spam s from the same spammer based on the content and the style of the spam. Spam s from different sources and for a varied period of time are studied with two set of features: stylistic and semantic. Three sets of ing are performed: ing based on stylistic feature, ing based on semantic feature and ing based on combined feature. The s formed for different conventional ing algorithms are compared and evaluated. Spam s from the same sources have similarities and together. Spam s have URLs of the WebPages that the spammer is trying to promote. Clusters are mapped to the internet protocol (IP) of these URLs and the whois information of the IP addresses help to get the source of spam. Keywords: Spam; semantics; stylistics, machine learning, ing 1. Introduction Spam s are a major security concern, not only do they serve as means of earning illegal profit by selling pharmaceutical products, replica watches, sexual enhancers etc. but also they spread malwares that either infest the receiver computer as bots or rob credentials from the user. Spam s eat up huge amount of network bandwidth and computer memory and is one of the most important concerns of computer forensic experts. Various methods have been adopted to combat spam s from entering the inboxes, but they are just temporary solutions to the grave problem. The most effective resistance against spam would be to track the spammers or the master minds behind such spam campaigns and prosecute them. The criminal or the spammers stay connected to servers called the command and control servers(c&c server). These servers are called so because they are based on the topology of command and control. The C&C servers are used to send spam s to a few computers with malwares, the malwares if executed set up the recipient s computer as a bot and can be used to send spam. These bots remain hooked to the C&C server which is operated by the Primary Spammer or the Bad Guys (As shown in figure: 1). The primary sender updates information in the C&C server from time to time and this gets reflected to the victim s computer. This leads to the conclusion that the spam originating from the same criminal will have the same style of writing in the irrespective of whatever bots they use to propagate spam. Data mining techniques, like Clustering and Classification, can be utilized to find similarities among these spam s. A very important requirement for such approaches is choosing the correct set of features that can help in distinguishing the s. In this work, we consider text in the body and style of the s as the determining factors for detecting the similarity and hence the source of the spam. Text from the body helps to determine the semantic attributes of the and the style of the s is predicted by stylistic features like use of punctuations, contractions etc. Once we have the ed spam, IP addresses of the web links the s of a particular can be verified to track the information of the agency hosting the site and legal procedures can be taken. Thus these approaches of ing spam s not only ensembles spam with similar content and style but actually s from different campaigns under a single roof. We divide this paper into four main sections, the second section does a background study of the different spam prevention and detection techniques, the third
2 section introduces our procedure and the last section analyzes and compares results for different ing algorithms used in the two different phases of ing. Figure 1. Diagram representing the distribution of the Spam Network 2. Related Work Use of machine learning and statistical techniques for spam classification is a traditional approach. The oldest and most conventional anti spamming technique, Bayesian filtering method [1] used probability of the recurring words to decide whether an is spam or ham (legitimate). Other techniques of such content based spam filtering include use Support Vector Machines [2], Active Learning and Random Forests [3] and Neural Network [4]. But spam filters only prevent spam from entering the boxes; they take no action to track the sender of the . Spammers figure out various obfuscation techniques to bypass these filters and continue to flood mail boxes with unsolicited s. Therefore researchers today are more interested in studying the commonalities between different spamming tendencies of spammers. They believe that it is more effective to eliminate the source of spam by taking legal action against the spammer rather than just filtering those s. This paper is motivated by the same idea and we use semantic and stylistic procedures similar to those that are used in the fields of authorship attribution and genre classification. Li et al [5] in their study of spam over a period of time found that spam s generally arrive in bunches and are similar to one another either in their prototype or by the URLs them. The main conclusion of their research was that all the different spam campaigns around the world are unified under small group of spammers. Chun et al [6] studied similar tendencies of spammer behavior and concluded that ing similar s together based on the subject of the and the similarities in IP address can be used as an effective way to track spammers. They studied the spamming pattern of 350,394 of the total of 638,678 s that they collected for the period of one month from June-July Their results were based on the s that had subject similarity and top 5 daily s 83% of the total number of s. Their top s had IP addresses that pointed to IP hosts in China and had as many as 3427 domain names registered under the same IP address. Guerra & Pires [7] have been studying Brazilian spamming patterns for a few years and they use the decision tree approach to find similarities in spam s. Their spam miner places s in the decision tree based on four main attributes language, layout, web link in the and the message type. The frequencies of each of these attributes are computed to place the in the right node. The depth of the tree decides the rate of recurrence of message. A recent technique used in this area is the method proposed by Zhang and Yang [8]. They consider the text in the body and compute word similarity and sentence similarity. Word similarity is computed based on the concept similarity of the words with HowNet (a knowledge base). Sentence similarity is dependent on the word similarity of the words that form the sentence. Similarity between the semantic bodies depends on the concept vector-space, formed by the combination of words and sentences. Whenever a spam is received it gets grouped according to the concept the content of the message. Semantic and Stylistic based spam ing has similarity with authorship attribution techniques used in web forums. Work done in [9] uses syntactic, semantic and stylistic classification to distinguish posts written by the same author. The method gave accuracy of about 90% and proved efficient for text of considerable length. Our method is similar to [9] but one challenge that we face is that often contents are too short and this makes the task more difficult and often decreases the level of accuracy that can be achieved as compared to documents of considerable length. 3. Methodology As mentioned earlier, we consider ing s based on the stylistic and semantic features of the s. In this work, our methodology can be divided into seven steps as shown in Figure Data Collection For this work we have collected the data from UAB spam database. The database collects data from different recipients after they report the particular to be spam. The database also contains data collected directly from
3 mail servers that have catch all accounts. Whenever a mail server receives an that has been sent to a non-existent s accounts they are forwarded to a default account [10] called the catch all account. We assume that any s reaching this account are spam. The initial dataset has approximately 10,000 spam s containing spam of all categories. Figure 2. Block diagram representing different stages of our methodology 3.2. Preprocessing This stage primarily involves cleaning of the data. Here we removed all s that were in any other language other than English. that had only attachments or web links (urls) in them were removed. We considered all s that had at least one line of text with more than 4 words. Preprocessing and data cleaning left us with count of approximately s i.e. 25% of the original data collected. We divided this into four data sets of different sizes (200, 700, 1300 and s in each set) and performed ing on them Feature Extraction This is a very important step in any ing process. Here we try to identify the features that can help in ing similar documents together. In our case, we would like to spam s based on the style and content of the s. Hence we divide our features into two main categories: the stylistic and the semantic features as describe below Stylistic features. Stylistic features are based on the style in which the is composed as s generated from the same botnet or written by the same spammer should have similarities in writing style. The main idea behind propagating a spam campaign is to pursue the users to buy a product or infect them with malwares. For this purpose these s invariably have a URL or id injected in the body and this can be used as a stylistic feature of the . Obfuscation or deliberate misspelling of words (Example: hello written as he110) is a common practice adapted by spammers to bypass the filters [10], hence the number of obfuscated or alphanumeric words is also a good feature. We consider a list of 57 stylistics features. The features are: 1. The total word count of the text in the 2. The number of new lines the 3. The total count of the punctuation used in the body 4. The total number of contractions the 5. The total number of obfuscated words the 6. The total number of ids the 7. The total number of URLs the body of the 8. The count of different punctuations used in the . We prepared a list of 50 different punctuations and calculated the frequency of appearance each of them in the body Semantic feature. By semantic features, we refer to the features that give us insight about the content or semantic meaning of the s. We have used the two classes of semantic features the Tf-Idf (Term frequency- Inverse Document frequency) for the top x most frequent words used in our dataset and the count of the top x bigrams used in the dataset, where x is the number that is decided based upon the cutoff of the minimum frequency count. In our case x usually varied from Tf-Idf is a statistical measure that can be used to represent the importance of a term in a document [11]. Before computing Tf-IDf we first remove all the stop words from the s. Stop words are the words such as articles,
4 which do not add any meaning to the semantics of a document. It is a common practice to remove stop words from the text before processing. Once we have the stop words removed, Tf-Idf can be calculated in three steps. Equation 1, calculates the term frequency (tf) of each term in a given document. tf i, j = n n i, j k, j k (1) Where, tf i,j is the term frequency of the term i in document j. n i,j is the number of times the term i occurred in the document j and the n k,j is the sum of the number of occurrences of all the terms in the document j. k in the above equation is any term in the document j. Equation 2 gives the general importance of a term in the whole corpus by dividing the total number of documents by the number of documents containing the term. idf stands for inverse document frequency, D is the total number of documents in the data set and the denominator of equation 2 is the number of documents where a term t i appears. idf i = log D { d : t d} i (2) Finally, Tf-Idf is the product of the results obtained in Equation 1 and 2 and is given by Equation 3. ( tf Idf ) = tf i j idf i i j,, (3) Bigram is referred to as any two words that occur consecutively in a document. If we know the most frequent bigrams used in a document, it helps us to do a better analysis of its content. For our second class of semantic feature, we first prepared a corpus of top x most frequently used bigrams from the whole data set and then made a feature vector that keeps the count of those bigram occurrences for each of the documents. x is the cut off value that is based on a pre decided minimum frequency of the total bigrams. Again x is varies from in our case Combined Feature. We take a combination of both the stylistic features and the semantic features. Once we have these features extracted for each document in our data set, we proceed to the ing step Clustering As mentioned in the previous step, we extracted stylistic and semantic features separately for the spam s. We wanted to see the outcome of each of these feature sets separately as well as combined on our ing results. Consequently, we performed three different sets of ing on our data. We also used two different kinds of ing algorithms in our experiment- K-means and Expectation Maximization (EM). In this work we used Weka implementation of all these three algorithms [12]. We used these two algorithms because we wanted to test the different approaches to ing the partition ing approach (K-Means) and the unsupervised method (EM) in which we do not provide the ing algorithm with any number of s to the data into [13]. Therefore, we show three different ing results for each dataset on the two different algorithms Cluster Evaluation Once we have the s, we evaluate them based on the ground truth with the data that was manually collected. We calculate the of the overall ing technique and present the results in the next Section. was calculated using Equation 4. We also analyze the s individually and present the result of the that gave us the highest accuracy for each of the algorithms and each of the feature sets. #of correctly ed Instances In each class (%) = total number of instances (4) While calculating the, we ignored those s that were assigned very few instances in our experiments. We had a minimum threshold 8% i.e. at least 8% of the total s should be the so that we can consider it to be valid. We do this because we want to avoid false alarms given by singleton or small s claiming they are 100% pure Mapping Cluster to Domain Name / IP address retrieval IP addresses of the URLs embedded in the s are fetched from the web links/ domain names and stored in the IP table with proper Message ID corresponding to each in the s. The WHOIS information of the IP address can be found out once mapping has been done. WHOIS information retrieval is the process by which domain registration information is fetched. This can help identify web servers that host large number of spam websites. However one issue with finding IP address is that they have to be found in real time because spam campaigns are generally active for a limited amount of time, after which the domain becomes inactive or domain get blacklisted. When done in real time this information helps cybercrime investigators and they can take appropriate action against them.
5 4. Results In this section we present our results based on the three different types of ing. Table 1 shows the ing accuracy of the K-means algorithm on our dataset. We can see from the table that the of the combined feature set is better than both the stylistic and the semantic features individually. Similarly Table 2 shows the results of the EM algorithm. We can also see from the columns 3, 6 and 9 of Table 1 and 2, that although the overall of the algorithms in any dataset is not high, however the that performed the best still had a high. Stylistic ing gave good results when the length was short (i.e. where the total count of words was low). marked by distinguishing punctuations like a sentence always ending with an exclamation mark (!) or question mark (?) or style are easy to identify using this features. Semantic s give good results when the semantic body is rich in content. Hence we can say that the length of the s also affect the type of distinguishing features. For example, in s dataset the stylistic ing gave better results than semantic because many of the s in that dataset were smaller in textual content and not enough to be distinguished by the semantic ing approach. However, the combined balances the effect of both. From the table below, we can see that the data set of s produces better results than s data set. This is because of the following two reasons. Firstly, the additional s that were added to the previous s were mostly larger in their text content and this made it easier to extract meaningful data from them for ing. Secondly, most of the s in that set originally belonged to one and had a similar pattern in their style of writing. Because of the above two reasons, we also saw a vast improvement in the overall accuracy of K-means algorithm in ing that data set. On doing IP address mapping of the s we were successful to get results for only the last 1300 set of s because most of the domains of the previous 1300 data set were inactive by then. These s mapped to 26 IP unique addresses and we could get the WHOIS information for these addresses. The WHOIS information for these showed that the address were prevalently from countries like USA, China and Russia. 5. Conclusion In this paper we present a stylistic and semantic way of ing spam s to identify the spam of similar types. We experimented with different number of data set and two different ing algorithms and observed that K-means performed the best. We can use this methodology to identify writing styles of each spam campaign from each spammer and can make a prototype from it that can be used for future identification of spam s of similar types. in the same generally point to the same campaign. Therefore s with not very purities can point to the leading spam tendencies for the period. In the future, we would like to experiment with various other features such as, subjects, sender ID, URLs the s, number and type of attachments, etc. that can help us in further improving our accuracy. We would like to perform a feature analysis for this task. It is very important to analyze which features can give a better output in this area because of the following reasons. Firstly, the amount of data or s is of vast amount and the feature extraction and ing can take a long time. It will be useful to know which of the features can be more important and can help save computational time. Secondly, we deal with real time data which means that we need something that is fast and generalized. The spam s keep on changing and the technique to them needs to be something that can adopt to those changes. If we can come up with a set of features that give good results and can work on different kind of s, we could help the computer forensics experts to get to the main. Table 1. Table showing the of Clusters using K-mean algorithm on our data set Data set Stylistic Cluster Semantic Cluster Combined Cluster # of Obtained # of the with highest Obtained the with highest Obtained the with highest % 100.% % 100% % 100% % 100% % 100% % 100% 70
6 % 83.2% % 96.0% % 98.8% % 84.0% % 76.7% % 100% 1577 Table 2. Table showing the of Clusters using EM algorithm on our data set Data set Stylistic Cluster Semantic Cluster Combined Cluster # of # of the with highest # of the with highest the with highest % 56.1% % 90.0% % 96.1% % 83.3% % 63.0% % 60.2% % 98.8% % 91.0% % 96.6% % 99.0% % 89.0% % 79.0% References [1] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A Baysian Approach to Filtering Junk .In Proc. of AAAI-98 Workshop on Learning for Text Categorization.,USA, [2] C. Liu, Experiments on Spam Detection with Boosting, SVM and Naïve Bayes, UCSC. [3] D. Debarr, H. Wechsler, Spam Detection using Clustering, Random Forests and Active Learning. In Proc. of the 6 th Conf. on and Anti-Spam, CEAS, USA [4] F. Tzeng, K. Ma, Opening the Black Box :Data Driven Visualization of Neural Networks. Visualization, VIS 05. IEEE.Page [5] F. Li, M. Hseieh, An Emperical Study of Clustering Behavior of Spammers and Group Based Anti Spam Strategies. In Proc. of the 3 rd Conf. on and Anti- Spam,USA, [6] C. Wei, A.P. Sprague, G. Warner, and A. Skjellum. Clustering spam domains and targeting spam origin for forensic analysis, J. Digital Forensics, Security, and Law, Vol:5, Campaigns, In Proc. of the 6 th Conf. on and Anti- Spam,CEAS,USA, [8] Q. Zhang, H. Yang, Z. Yuan, J. Sun, Studies on the Semantic Body-Based Spam Filtering, In Proc. of the Intl. Conf. of Information Science and Management Engineering, [9] S. Pillay, T. Solorio, Authorship Attribution of Web Forum Posts, In Proc. of the ecrime Researchers Summit, USA, [10] C. Liu, S. Stamm, Fighting Unicode Obfuscated Spam, In Proc. of the anti-phishing working groups 2nd annual ecrime Researchers Summit, USA, [11] G. Salton, M. J. McGill. Introduction to modern information retrieval, McGraw-Hill. [12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The WEKA data mining software: An update, SIGKDD Explorations, Volume 11, Issue 1, [13] P. Tan, Steinbach, M., V. Kumar, Introduction to Data Mining, (First Edition), Addison-Wesley Longman Publishing Co., [7] P. Guerra, D. Pires, D. Guedes, Spam Miner: A Platform for Detecting and Characterizing Spam
An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization
An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationFighting Spam, Phishing and Malware With Recurrent Pattern Detection
Fighting Spam, Phishing and Malware With Recurrent Pattern Detection White Paper September 2017 www.cyren.com 1 White Paper September 2017 Fighting Spam, Phishing and Malware With Recurrent Pattern Detection
More informationFiltering Spam by Using Factors Hyperbolic Trees
Filtering Spam by Using Factors Hyperbolic Trees Hailong Hou*, Yan Chen, Raheem Beyah, Yan-Qing Zhang Department of Computer science Georgia State University P.O. Box 3994 Atlanta, GA 30302-3994, USA *Contact
More informationInternational ejournals
Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationPERSONALIZATION OF MESSAGES
PERSONALIZATION OF E-MAIL MESSAGES Arun Pandian 1, Balaji 2, Gowtham 3, Harinath 4, Hariharan 5 1,2,3,4 Student, Department of Computer Science and Engineering, TRP Engineering College,Tamilnadu, India
More informationIn this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.
December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)
More informationDetecting Spam Zombies By Monitoring Outgoing Messages
International Refereed Journal of Engineering and Science (IRJES) ISSN (Online) 2319-183X, (Print) 2319-1821 Volume 5, Issue 5 (May 2016), PP.71-75 Detecting Spam Zombies By Monitoring Outgoing Messages
More informationCollaborative Spam Mail Filtering Model Design
I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme
More informationProject Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI
University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationAdvanced Spam Detection Methodology by the Neural Network Classifier
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,
More informationAdvanced Filtering. Tobias Eggendorfer
Advanced Filtering Advanced Filtering Fails Too Overview Not so advanced Filtering Advanced Filtering Prevention Identification 2 Classic Filtering 3 Classic Filtering Black- & Whitelists 3 Classic Filtering
More informationA Novel Approach of Mining Write-Prints for Authorship Attribution in Forensics
DIGITAL FORENSIC RESEARCH CONFERENCE A Novel Approach of Mining Write-Prints for Authorship Attribution in E-mail Forensics By Farkhund Iqbal, Rachid Hadjidj, Benjamin Fung, Mourad Debbabi Presented At
More informationChapter-8. Conclusion and Future Scope
Chapter-8 Conclusion and Future Scope This thesis has addressed the problem of Spam E-mails. In this work a Framework has been proposed. The proposed framework consists of the three pillars which are Legislative
More informationContent Based Spam Filtering
2016 International Conference on Collaboration Technologies and Systems Content Based Spam E-mail Filtering 2nd Author Pingchuan Liu and Teng-Sheng Moh Department of Computer Science San Jose State University
More informationCHEAP, efficient and easy to use, has become an
Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 A Multi-Resolution-Concentration Based Feature Construction Approach for Spam Filtering Guyue Mi,
More informationEffective Scheme for Reducing Spam in System
Effective Scheme for Reducing Spam in Email System 1 S. Venkatesh, 2 K. Geetha, 3 P. Manju Priya, 4 N. Metha Rani 1 Assistant Professor, 2,3,4 UG Scholar Department of Computer science and engineering
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationData Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005
Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate
More informationCSI5387: Data Mining Project
CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play
More informationImproving the methods of classification based on words ontology
www.ijcsi.org 262 Improving the methods of email classification based on words ontology Foruzan Kiamarzpour 1, Rouhollah Dianat 2, Mohammad bahrani 3, Mehdi Sadeghzadeh 4 1 Department of Computer Engineering,
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationNew Developments in the SpamPots Project
New Developments in the SpamPots Project Klaus Steding-Jessen Cristine Hoepers CERT.br CERT Brazil http://www.cert.br/ NIC.br Brazilian Network Information Center http://www.nic.br/
More informationKeywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationMEASURING AND FINGERPRINTING CLICK-SPAM IN AD NETWORKS
MEASURING AND FINGERPRINTING CLICK-SPAM IN AD NETWORKS Vacha Dave *, Saikat Guha and Yin Zhang * * The University of Texas at Austin Microsoft Research India Internet Advertising Today 2 Online advertising
More informationIntroduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple
Table of Contents Introduction...2 Overview...3 Common techniques to identify SPAM...4 Greylisting...5 Dictionary Attack...5 Catchalls...5 From address...5 HELO / EHLO...6 SPF records...6 Detecting SPAM...6
More informationSpam Decisions on Gray using Personalized Ontologies
Spam Decisions on Gray E-mail using Personalized Ontologies Seongwook Youn Semantic Information Research Laboratory (http://sir-lab.usc.edu) Dept. of Computer Science Univ. of Southern California Los Angeles,
More informationBitDefender Antispam NeuNet
BitDefender Antispam NeuNet Whitepaper Cosoi Alexandru Catalin Researcher BitDefender AntiSpam Laboratory Contents 1. Overview of the Spam Issue 2. About Neural Networks 3. New Structure Using Neural Networks
More informationAutomatic New Topic Identification in Search Engine Transaction Log Using Goal Programming
Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log
More informationDiagnosis of Spams Some Statistical Considerations
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 3, Issue 4 (August 2012), PP. 05-09 Diagnosis of Email Spams Some Statistical Considerations
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationAn Empirical Study of Behavioral Characteristics of Spammers: Findings and Implications
An Empirical Study of Behavioral Characteristics of Spammers: Findings and Implications Zhenhai Duan, Kartik Gopalan, Xin Yuan Abstract In this paper we present a detailed study of the behavioral characteristics
More informationSemi supervised clustering for Text Clustering
Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering
More informationPhishing Activity Trends Report August, 2005
Phishing Activity Trends Report August, 25 Phishing is a form of online identity theft that employs both social engineering and technical subterfuge to steal consumers' personal identity data and financial
More informationEnhancing Clustering Results In Hierarchical Approach By Mvs Measures
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach
More information2. Design Methodology
Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily
More informationDetect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning
Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning Jing Ma 1, Wei Gao 2*, Kam-Fai Wong 1,3 1 The Chinese University of Hong Kong 2 Victoria University of Wellington, New Zealand
More informationChapter 8 The C 4.5*stat algorithm
109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the
More informationThe evolution of malevolence
Detection of spam hosts and spam bots using network traffic modeling Anestis Karasaridis Willa K. Ehrlich, Danielle Liu, David Hoeflin 4/27/2010. All rights reserved. AT&T and the AT&T logo are trademarks
More informationKeywords Data alignment, Data annotation, Web database, Search Result Record
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web
More informationNews Filtering and Summarization System Architecture for Recognition and Summarization of News Pages
Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---
More informationExtraction of Web Image Information: Semantic or Visual Cues?
Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus
More informationBOTNET-GENERATED SPAM
BOTNET-GENERATED SPAM By Areej Al-Bataineh University of Texas at San Antonio MIT Spam Conference 2009 www.securitycartoon.com 3/27/2009 Areej Al-Bataineh - Botnet-generated Spam 2 1 Botnets: A Global
More informationA study of Video Response Spam Detection on YouTube
A study of Video Response Spam Detection on YouTube Suman 1 and Vipin Arora 2 1 Research Scholar, Department of CSE, BITS, Bhiwani, Haryana (India) 2 Asst. Prof., Department of CSE, BITS, Bhiwani, Haryana
More informationAn approach for Malicious Spam Detection In with comparison of different classifiers
An approach for Malicious Spam Detection In Email with comparison of different classifiers Umesh Kumar Sah 1,Narendra Parmar 2 1M.Tech Scholar, 2 Assistant Professor, 1,2 Sri Satya Sai College of Engineering,
More informationDetecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
More informationAccuracy Analysis of Neural Networks in removal of unsolicited s
Accuracy Analysis of Neural Networks in removal of unsolicited e-mails P.Mohan Kumar P.Kumaresan S.Yokesh Babu Assistant Professor (Senior) Assistant Professor Assistant Professor (Senior) SITE SITE SCSE
More informationAn Overview of Concept Based and Advanced Text Clustering Methods.
An Overview of Concept Based and Advanced Text Clustering Methods. B.Jyothi, D.Sailaja, Dr.Y.Srinivasa Rao, GITAM, ANITS, GITAM, Asst.Professor Asst.Professor Professor Abstract: Most of the common techniques
More informationInferring User Search for Feedback Sessions
Inferring User Search for Feedback Sessions Sharayu Kakade 1, Prof. Ranjana Barde 2 PG Student, Department of Computer Science, MIT Academy of Engineering, Pune, MH, India 1 Assistant Professor, Department
More informationFiltering Unwanted Messages from (OSN) User Wall s Using MLT
Filtering Unwanted Messages from (OSN) User Wall s Using MLT Prof.Sarika.N.Zaware 1, Anjiri Ambadkar 2, Nishigandha Bhor 3, Shiva Mamidi 4, Chetan Patil 5 1 Department of Computer Engineering, AISSMS IOIT,
More informationText Classification for Spam Using Naïve Bayesian Classifier
Text Classification for E-mail Spam Using Naïve Bayesian Classifier Priyanka Sao 1, Shilpi Chaubey 2, Sonali Katailiha 3 1,2,3 Assistant ProfessorCSE Dept, Columbia Institute of Engg&Tech, Columbia Institute
More informationSpam Classification Documentation
Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:
More informationISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationComment Extraction from Blog Posts and Its Applications to Opinion Mining
Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
More informationAnalysis of Naïve Bayes Algorithm for Spam Filtering across Multiple Datasets
IOP Conference Series: Materials Science and Engineering PAPER OPEN ACCESS Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets To cite this article: Nurul Fitriah Rusland
More informationReview Spam Analysis using Term-Frequencies
Volume 03 - Issue 06 June 2018 PP. 132-140 Review Spam Analysis using Term-Frequencies Jyoti G.Biradar School of Mathematics and Computing Sciences Department of Computer Science Rani Channamma University
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationClassifying Twitter Data in Multiple Classes Based On Sentiment Class Labels
Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),
More informationAn Automatic Reply to Customers Queries Model with Chinese Text Mining Approach
Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach
More informationKarami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.
Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review
More informationSemantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman
Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information
More informationCountering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008
Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification
More informationPredictive Analysis: Evaluation and Experimentation. Heejun Kim
Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training
More informationThe importance of Whois data bases for spam enforcement
The importance of Whois data bases for spam enforcement Chris Fonteijn Chairman OPTA Joint meeting GAC/GNSO Marrakech, Monday 26 June 2006 1 Introduction My name is Chris Fonteijn and I am chairman of
More information3.5 SECURITY. How can you reduce the risk of getting a virus?
3.5 SECURITY 3.5.4 MALWARE WHAT IS MALWARE? Malware, short for malicious software, is any software used to disrupt the computer s operation, gather sensitive information without your knowledge, or gain
More informationCS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai
CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationTechnical Brief: Domain Risk Score Proactively uncover threats using DNS and data science
Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science 310 Million + Current Domain Names 11 Billion+ Historical Domain Profiles 5 Million+ New Domain Profiles Daily
More informationThe Challenge of Spam An Internet Society Public Policy Briefing
The Challenge of Spam An Internet Society Public Policy Briefing 30 October 2015 Introduction Spam email, those unsolicited email messages we find cluttering our inboxes, are a challenge for Internet users,
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationCombining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating
Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,
More informationWhy we spam? 1. To get Bank Logs by spamming different banks.
Hello guys this is tutorial in depth of the topic spamming. First of we will see what do we mean by term spamming. Wikipedia definition: Email spam, also known as unsolicited bulk Email (UBE), junk mail,
More informationAUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS
AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS Nilam B. Lonkar 1, Dinesh B. Hanchate 2 Student of Computer Engineering, Pune University VPKBIET, Baramati, India Computer Engineering, Pune University VPKBIET,
More informationWeb Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India
Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the
More informationDomain name system black list false reporting attack
Domain name system black list false reporting attack Ing. Miloš Očkay, PhD 1, Ing. Martin Javurek 2, 1 Department of Informatics Armed Forces Academy of gen. M. R. Štefánik Liptovský Mikuláš, Slovakia
More informationMethod to Study and Analyze Fraud Ranking In Mobile Apps
Method to Study and Analyze Fraud Ranking In Mobile Apps Ms. Priyanka R. Patil M.Tech student Marri Laxman Reddy Institute of Technology & Management Hyderabad. Abstract: Ranking fraud in the mobile App
More informationHOT ZONE IDENTIFICATION: ANALYZING EFFECTS OF DATA SAMPLING ON SPAM CLUSTERING
HOT ZONE IDENTIFICATION: ANALYZING EFFECTS OF DATA SAMPLING ON SPAM CLUSTERING Rasib Khan, Mainul Mizan, Ragib Hasan, Alan Sprague Department of Computer and Information Sciences University of Alabama
More informationNon-ML Anti-Spamming: A Role Based Solution
Non-ML Anti-Spamming: A Role Based Solution Anthony Y. Fu, Email: anthony@cs.cityu.edu.hk WebPage: http://www.cs.cityu.edu.hk/~anthony Department of Computer Science, City University of Hong Kong Hong
More informationClustering Potential Phishing Websites Using DeepMD5 Abstract 1. Introduction
Clustering Potential Phishing Websites Using DeepMD5 Jason Britt, Brad Wardman, Dr. Alan Sprague, Gary Warner Department of Computer & Inf. Sciences University of Alabama at Birmingham Birmingham, AL 35294
More informationTopic Classification in Social Media using Metadata from Hyperlinked Objects
Topic Classification in Social Media using Metadata from Hyperlinked Objects Sheila Kinsella 1, Alexandre Passant 1, and John G. Breslin 1,2 1 Digital Enterprise Research Institute, National University
More informationInternational Journal Of Engineering Research & Management Technology
International Journal Of Engineering Research & Management Technology ISSN: 2348-4039 Email: editor@ijermt.org July- 2014 Volume 1, Issue-4 www.ijermt.org Document Clustering For Digital Devices: An Approach
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationClassifying and Predicting Spam Messages Using Text Mining in SAS Enterprise Miner Session ID: 2650
Classifying and Predicting Spam Messages Using Text Mining in SAS Enterprise Miner Session ID: 2650 Mounika Kondamudi, Oklahoma State University Mentored by Balamurugan Mohan, H&R Block SAS and all other
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK ANALYSIS OF WEB USAGE MINING TECHNIQUES FOR WEB CRIME PATTERNS OF THE WEB USERS
More informationANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES
ANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES Shaufiah, Imanudin and Ibnu Asror Mining Center Laboratory, School of Computing, Telkom University, Bandung, Indonesia E-Mail:
More informationAn Improved Apriori Algorithm for Association Rules
Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan
More informationFighting the. Botnet Ecosystem. Renaud BIDOU. Page 1
Fighting the Botnet Ecosystem Renaud BIDOU Page 1 Bots, bots, bots Page 2 Botnet classification Internal Structure Command model Propagation mechanism 1. Monolithic Coherent, all features in one binary
More informationAn Efficient Clustering for Crime Analysis
An Efficient Clustering for Crime Analysis Malarvizhi S 1, Siddique Ibrahim 2 1 UG Scholar, Department of Computer Science and Engineering, Kumaraguru College Of Technology, Coimbatore, Tamilnadu, India
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationSchematizing a Global SPAM Indicative Probability
Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens,
More informationA platform for automatic identification of phishing URLs in mobile text messages
Journal of Physics: Conference Series PAPER OPEN ACCESS A platform for automatic identification of phishing URLs in mobile text messages To cite this article: Xiang Xun Sun et al 208 J. Phys.: Conf. Ser.
More informationReference Point Detection for Arch Type Fingerprints
Reference Point Detection for Arch Type Fingerprints H.K. Lam 1, Z. Hou 1, W.Y. Yau 1, T.P. Chen 1, J. Li 2, and K.Y. Sim 2 1 Computer Vision and Image Understanding Department Institute for Infocomm Research,
More information[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632
More informationKnowledge Engineering in Search Engines
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:
More informationImproving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University
Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar, Youngstown State University Bonita Sharif, Youngstown State University Jenna Wise, Youngstown State University Alyssa Pawluk, Youngstown
More informationLetter Pair Similarity Classification and URL Ranking Based on Feedback Approach
Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India
More information