Information Extraction from Spam s using Stylistic and Semantic Features to Identify Spammers

Size: px
Start display at page:

Download "Information Extraction from Spam s using Stylistic and Semantic Features to Identify Spammers"

Transcription

1 Information Extraction from Spam using Stylistic and Semantic Features to Identify Spammers Soma Halder University of Alabama at Birmingham Richa Tiwari University of Alabama at Birmingham Alan Sprague University of Alabama at Birmingham Abstract Traditional anti spamming methods filter spam s and prevent them from entering the inbox but take no measures to trace spammers and penalize them. This paper uses Natural Language Processing and Machine Learning techniques to the spam s from the same spammer based on the content and the style of the spam. Spam s from different sources and for a varied period of time are studied with two set of features: stylistic and semantic. Three sets of ing are performed: ing based on stylistic feature, ing based on semantic feature and ing based on combined feature. The s formed for different conventional ing algorithms are compared and evaluated. Spam s from the same sources have similarities and together. Spam s have URLs of the WebPages that the spammer is trying to promote. Clusters are mapped to the internet protocol (IP) of these URLs and the whois information of the IP addresses help to get the source of spam. Keywords: Spam; semantics; stylistics, machine learning, ing 1. Introduction Spam s are a major security concern, not only do they serve as means of earning illegal profit by selling pharmaceutical products, replica watches, sexual enhancers etc. but also they spread malwares that either infest the receiver computer as bots or rob credentials from the user. Spam s eat up huge amount of network bandwidth and computer memory and is one of the most important concerns of computer forensic experts. Various methods have been adopted to combat spam s from entering the inboxes, but they are just temporary solutions to the grave problem. The most effective resistance against spam would be to track the spammers or the master minds behind such spam campaigns and prosecute them. The criminal or the spammers stay connected to servers called the command and control servers(c&c server). These servers are called so because they are based on the topology of command and control. The C&C servers are used to send spam s to a few computers with malwares, the malwares if executed set up the recipient s computer as a bot and can be used to send spam. These bots remain hooked to the C&C server which is operated by the Primary Spammer or the Bad Guys (As shown in figure: 1). The primary sender updates information in the C&C server from time to time and this gets reflected to the victim s computer. This leads to the conclusion that the spam originating from the same criminal will have the same style of writing in the irrespective of whatever bots they use to propagate spam. Data mining techniques, like Clustering and Classification, can be utilized to find similarities among these spam s. A very important requirement for such approaches is choosing the correct set of features that can help in distinguishing the s. In this work, we consider text in the body and style of the s as the determining factors for detecting the similarity and hence the source of the spam. Text from the body helps to determine the semantic attributes of the and the style of the s is predicted by stylistic features like use of punctuations, contractions etc. Once we have the ed spam, IP addresses of the web links the s of a particular can be verified to track the information of the agency hosting the site and legal procedures can be taken. Thus these approaches of ing spam s not only ensembles spam with similar content and style but actually s from different campaigns under a single roof. We divide this paper into four main sections, the second section does a background study of the different spam prevention and detection techniques, the third

2 section introduces our procedure and the last section analyzes and compares results for different ing algorithms used in the two different phases of ing. Figure 1. Diagram representing the distribution of the Spam Network 2. Related Work Use of machine learning and statistical techniques for spam classification is a traditional approach. The oldest and most conventional anti spamming technique, Bayesian filtering method [1] used probability of the recurring words to decide whether an is spam or ham (legitimate). Other techniques of such content based spam filtering include use Support Vector Machines [2], Active Learning and Random Forests [3] and Neural Network [4]. But spam filters only prevent spam from entering the boxes; they take no action to track the sender of the . Spammers figure out various obfuscation techniques to bypass these filters and continue to flood mail boxes with unsolicited s. Therefore researchers today are more interested in studying the commonalities between different spamming tendencies of spammers. They believe that it is more effective to eliminate the source of spam by taking legal action against the spammer rather than just filtering those s. This paper is motivated by the same idea and we use semantic and stylistic procedures similar to those that are used in the fields of authorship attribution and genre classification. Li et al [5] in their study of spam over a period of time found that spam s generally arrive in bunches and are similar to one another either in their prototype or by the URLs them. The main conclusion of their research was that all the different spam campaigns around the world are unified under small group of spammers. Chun et al [6] studied similar tendencies of spammer behavior and concluded that ing similar s together based on the subject of the and the similarities in IP address can be used as an effective way to track spammers. They studied the spamming pattern of 350,394 of the total of 638,678 s that they collected for the period of one month from June-July Their results were based on the s that had subject similarity and top 5 daily s 83% of the total number of s. Their top s had IP addresses that pointed to IP hosts in China and had as many as 3427 domain names registered under the same IP address. Guerra & Pires [7] have been studying Brazilian spamming patterns for a few years and they use the decision tree approach to find similarities in spam s. Their spam miner places s in the decision tree based on four main attributes language, layout, web link in the and the message type. The frequencies of each of these attributes are computed to place the in the right node. The depth of the tree decides the rate of recurrence of message. A recent technique used in this area is the method proposed by Zhang and Yang [8]. They consider the text in the body and compute word similarity and sentence similarity. Word similarity is computed based on the concept similarity of the words with HowNet (a knowledge base). Sentence similarity is dependent on the word similarity of the words that form the sentence. Similarity between the semantic bodies depends on the concept vector-space, formed by the combination of words and sentences. Whenever a spam is received it gets grouped according to the concept the content of the message. Semantic and Stylistic based spam ing has similarity with authorship attribution techniques used in web forums. Work done in [9] uses syntactic, semantic and stylistic classification to distinguish posts written by the same author. The method gave accuracy of about 90% and proved efficient for text of considerable length. Our method is similar to [9] but one challenge that we face is that often contents are too short and this makes the task more difficult and often decreases the level of accuracy that can be achieved as compared to documents of considerable length. 3. Methodology As mentioned earlier, we consider ing s based on the stylistic and semantic features of the s. In this work, our methodology can be divided into seven steps as shown in Figure Data Collection For this work we have collected the data from UAB spam database. The database collects data from different recipients after they report the particular to be spam. The database also contains data collected directly from

3 mail servers that have catch all accounts. Whenever a mail server receives an that has been sent to a non-existent s accounts they are forwarded to a default account [10] called the catch all account. We assume that any s reaching this account are spam. The initial dataset has approximately 10,000 spam s containing spam of all categories. Figure 2. Block diagram representing different stages of our methodology 3.2. Preprocessing This stage primarily involves cleaning of the data. Here we removed all s that were in any other language other than English. that had only attachments or web links (urls) in them were removed. We considered all s that had at least one line of text with more than 4 words. Preprocessing and data cleaning left us with count of approximately s i.e. 25% of the original data collected. We divided this into four data sets of different sizes (200, 700, 1300 and s in each set) and performed ing on them Feature Extraction This is a very important step in any ing process. Here we try to identify the features that can help in ing similar documents together. In our case, we would like to spam s based on the style and content of the s. Hence we divide our features into two main categories: the stylistic and the semantic features as describe below Stylistic features. Stylistic features are based on the style in which the is composed as s generated from the same botnet or written by the same spammer should have similarities in writing style. The main idea behind propagating a spam campaign is to pursue the users to buy a product or infect them with malwares. For this purpose these s invariably have a URL or id injected in the body and this can be used as a stylistic feature of the . Obfuscation or deliberate misspelling of words (Example: hello written as he110) is a common practice adapted by spammers to bypass the filters [10], hence the number of obfuscated or alphanumeric words is also a good feature. We consider a list of 57 stylistics features. The features are: 1. The total word count of the text in the 2. The number of new lines the 3. The total count of the punctuation used in the body 4. The total number of contractions the 5. The total number of obfuscated words the 6. The total number of ids the 7. The total number of URLs the body of the 8. The count of different punctuations used in the . We prepared a list of 50 different punctuations and calculated the frequency of appearance each of them in the body Semantic feature. By semantic features, we refer to the features that give us insight about the content or semantic meaning of the s. We have used the two classes of semantic features the Tf-Idf (Term frequency- Inverse Document frequency) for the top x most frequent words used in our dataset and the count of the top x bigrams used in the dataset, where x is the number that is decided based upon the cutoff of the minimum frequency count. In our case x usually varied from Tf-Idf is a statistical measure that can be used to represent the importance of a term in a document [11]. Before computing Tf-IDf we first remove all the stop words from the s. Stop words are the words such as articles,

4 which do not add any meaning to the semantics of a document. It is a common practice to remove stop words from the text before processing. Once we have the stop words removed, Tf-Idf can be calculated in three steps. Equation 1, calculates the term frequency (tf) of each term in a given document. tf i, j = n n i, j k, j k (1) Where, tf i,j is the term frequency of the term i in document j. n i,j is the number of times the term i occurred in the document j and the n k,j is the sum of the number of occurrences of all the terms in the document j. k in the above equation is any term in the document j. Equation 2 gives the general importance of a term in the whole corpus by dividing the total number of documents by the number of documents containing the term. idf stands for inverse document frequency, D is the total number of documents in the data set and the denominator of equation 2 is the number of documents where a term t i appears. idf i = log D { d : t d} i (2) Finally, Tf-Idf is the product of the results obtained in Equation 1 and 2 and is given by Equation 3. ( tf Idf ) = tf i j idf i i j,, (3) Bigram is referred to as any two words that occur consecutively in a document. If we know the most frequent bigrams used in a document, it helps us to do a better analysis of its content. For our second class of semantic feature, we first prepared a corpus of top x most frequently used bigrams from the whole data set and then made a feature vector that keeps the count of those bigram occurrences for each of the documents. x is the cut off value that is based on a pre decided minimum frequency of the total bigrams. Again x is varies from in our case Combined Feature. We take a combination of both the stylistic features and the semantic features. Once we have these features extracted for each document in our data set, we proceed to the ing step Clustering As mentioned in the previous step, we extracted stylistic and semantic features separately for the spam s. We wanted to see the outcome of each of these feature sets separately as well as combined on our ing results. Consequently, we performed three different sets of ing on our data. We also used two different kinds of ing algorithms in our experiment- K-means and Expectation Maximization (EM). In this work we used Weka implementation of all these three algorithms [12]. We used these two algorithms because we wanted to test the different approaches to ing the partition ing approach (K-Means) and the unsupervised method (EM) in which we do not provide the ing algorithm with any number of s to the data into [13]. Therefore, we show three different ing results for each dataset on the two different algorithms Cluster Evaluation Once we have the s, we evaluate them based on the ground truth with the data that was manually collected. We calculate the of the overall ing technique and present the results in the next Section. was calculated using Equation 4. We also analyze the s individually and present the result of the that gave us the highest accuracy for each of the algorithms and each of the feature sets. #of correctly ed Instances In each class (%) = total number of instances (4) While calculating the, we ignored those s that were assigned very few instances in our experiments. We had a minimum threshold 8% i.e. at least 8% of the total s should be the so that we can consider it to be valid. We do this because we want to avoid false alarms given by singleton or small s claiming they are 100% pure Mapping Cluster to Domain Name / IP address retrieval IP addresses of the URLs embedded in the s are fetched from the web links/ domain names and stored in the IP table with proper Message ID corresponding to each in the s. The WHOIS information of the IP address can be found out once mapping has been done. WHOIS information retrieval is the process by which domain registration information is fetched. This can help identify web servers that host large number of spam websites. However one issue with finding IP address is that they have to be found in real time because spam campaigns are generally active for a limited amount of time, after which the domain becomes inactive or domain get blacklisted. When done in real time this information helps cybercrime investigators and they can take appropriate action against them.

5 4. Results In this section we present our results based on the three different types of ing. Table 1 shows the ing accuracy of the K-means algorithm on our dataset. We can see from the table that the of the combined feature set is better than both the stylistic and the semantic features individually. Similarly Table 2 shows the results of the EM algorithm. We can also see from the columns 3, 6 and 9 of Table 1 and 2, that although the overall of the algorithms in any dataset is not high, however the that performed the best still had a high. Stylistic ing gave good results when the length was short (i.e. where the total count of words was low). marked by distinguishing punctuations like a sentence always ending with an exclamation mark (!) or question mark (?) or style are easy to identify using this features. Semantic s give good results when the semantic body is rich in content. Hence we can say that the length of the s also affect the type of distinguishing features. For example, in s dataset the stylistic ing gave better results than semantic because many of the s in that dataset were smaller in textual content and not enough to be distinguished by the semantic ing approach. However, the combined balances the effect of both. From the table below, we can see that the data set of s produces better results than s data set. This is because of the following two reasons. Firstly, the additional s that were added to the previous s were mostly larger in their text content and this made it easier to extract meaningful data from them for ing. Secondly, most of the s in that set originally belonged to one and had a similar pattern in their style of writing. Because of the above two reasons, we also saw a vast improvement in the overall accuracy of K-means algorithm in ing that data set. On doing IP address mapping of the s we were successful to get results for only the last 1300 set of s because most of the domains of the previous 1300 data set were inactive by then. These s mapped to 26 IP unique addresses and we could get the WHOIS information for these addresses. The WHOIS information for these showed that the address were prevalently from countries like USA, China and Russia. 5. Conclusion In this paper we present a stylistic and semantic way of ing spam s to identify the spam of similar types. We experimented with different number of data set and two different ing algorithms and observed that K-means performed the best. We can use this methodology to identify writing styles of each spam campaign from each spammer and can make a prototype from it that can be used for future identification of spam s of similar types. in the same generally point to the same campaign. Therefore s with not very purities can point to the leading spam tendencies for the period. In the future, we would like to experiment with various other features such as, subjects, sender ID, URLs the s, number and type of attachments, etc. that can help us in further improving our accuracy. We would like to perform a feature analysis for this task. It is very important to analyze which features can give a better output in this area because of the following reasons. Firstly, the amount of data or s is of vast amount and the feature extraction and ing can take a long time. It will be useful to know which of the features can be more important and can help save computational time. Secondly, we deal with real time data which means that we need something that is fast and generalized. The spam s keep on changing and the technique to them needs to be something that can adopt to those changes. If we can come up with a set of features that give good results and can work on different kind of s, we could help the computer forensics experts to get to the main. Table 1. Table showing the of Clusters using K-mean algorithm on our data set Data set Stylistic Cluster Semantic Cluster Combined Cluster # of Obtained # of the with highest Obtained the with highest Obtained the with highest % 100.% % 100% % 100% % 100% % 100% % 100% 70

6 % 83.2% % 96.0% % 98.8% % 84.0% % 76.7% % 100% 1577 Table 2. Table showing the of Clusters using EM algorithm on our data set Data set Stylistic Cluster Semantic Cluster Combined Cluster # of # of the with highest # of the with highest the with highest % 56.1% % 90.0% % 96.1% % 83.3% % 63.0% % 60.2% % 98.8% % 91.0% % 96.6% % 99.0% % 89.0% % 79.0% References [1] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A Baysian Approach to Filtering Junk .In Proc. of AAAI-98 Workshop on Learning for Text Categorization.,USA, [2] C. Liu, Experiments on Spam Detection with Boosting, SVM and Naïve Bayes, UCSC. [3] D. Debarr, H. Wechsler, Spam Detection using Clustering, Random Forests and Active Learning. In Proc. of the 6 th Conf. on and Anti-Spam, CEAS, USA [4] F. Tzeng, K. Ma, Opening the Black Box :Data Driven Visualization of Neural Networks. Visualization, VIS 05. IEEE.Page [5] F. Li, M. Hseieh, An Emperical Study of Clustering Behavior of Spammers and Group Based Anti Spam Strategies. In Proc. of the 3 rd Conf. on and Anti- Spam,USA, [6] C. Wei, A.P. Sprague, G. Warner, and A. Skjellum. Clustering spam domains and targeting spam origin for forensic analysis, J. Digital Forensics, Security, and Law, Vol:5, Campaigns, In Proc. of the 6 th Conf. on and Anti- Spam,CEAS,USA, [8] Q. Zhang, H. Yang, Z. Yuan, J. Sun, Studies on the Semantic Body-Based Spam Filtering, In Proc. of the Intl. Conf. of Information Science and Management Engineering, [9] S. Pillay, T. Solorio, Authorship Attribution of Web Forum Posts, In Proc. of the ecrime Researchers Summit, USA, [10] C. Liu, S. Stamm, Fighting Unicode Obfuscated Spam, In Proc. of the anti-phishing working groups 2nd annual ecrime Researchers Summit, USA, [11] G. Salton, M. J. McGill. Introduction to modern information retrieval, McGraw-Hill. [12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The WEKA data mining software: An update, SIGKDD Explorations, Volume 11, Issue 1, [13] P. Tan, Steinbach, M., V. Kumar, Introduction to Data Mining, (First Edition), Addison-Wesley Longman Publishing Co., [7] P. Guerra, D. Pires, D. Guedes, Spam Miner: A Platform for Detecting and Characterizing Spam

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection Fighting Spam, Phishing and Malware With Recurrent Pattern Detection White Paper September 2017 www.cyren.com 1 White Paper September 2017 Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

More information

Filtering Spam by Using Factors Hyperbolic Trees

Filtering Spam by Using Factors Hyperbolic Trees Filtering Spam by Using Factors Hyperbolic Trees Hailong Hou*, Yan Chen, Raheem Beyah, Yan-Qing Zhang Department of Computer science Georgia State University P.O. Box 3994 Atlanta, GA 30302-3994, USA *Contact

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

PERSONALIZATION OF MESSAGES

PERSONALIZATION OF  MESSAGES PERSONALIZATION OF E-MAIL MESSAGES Arun Pandian 1, Balaji 2, Gowtham 3, Harinath 4, Hariharan 5 1,2,3,4 Student, Department of Computer Science and Engineering, TRP Engineering College,Tamilnadu, India

More information

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

In this project, I examined methods to classify a corpus of  s by their content in order to suggest text blocks for semi-automatic replies. December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)

More information

Detecting Spam Zombies By Monitoring Outgoing Messages

Detecting Spam Zombies By Monitoring Outgoing Messages International Refereed Journal of Engineering and Science (IRJES) ISSN (Online) 2319-183X, (Print) 2319-1821 Volume 5, Issue 5 (May 2016), PP.71-75 Detecting Spam Zombies By Monitoring Outgoing Messages

More information

Collaborative Spam Mail Filtering Model Design

Collaborative Spam Mail Filtering Model Design I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme

More information

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Advanced Spam Detection Methodology by the Neural Network Classifier

Advanced  Spam Detection Methodology by the Neural Network Classifier Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,

More information

Advanced Filtering. Tobias Eggendorfer

Advanced Filtering. Tobias Eggendorfer Advanced Filtering Advanced Filtering Fails Too Overview Not so advanced Filtering Advanced Filtering Prevention Identification 2 Classic Filtering 3 Classic Filtering Black- & Whitelists 3 Classic Filtering

More information

A Novel Approach of Mining Write-Prints for Authorship Attribution in Forensics

A Novel Approach of Mining Write-Prints for Authorship Attribution in  Forensics DIGITAL FORENSIC RESEARCH CONFERENCE A Novel Approach of Mining Write-Prints for Authorship Attribution in E-mail Forensics By Farkhund Iqbal, Rachid Hadjidj, Benjamin Fung, Mourad Debbabi Presented At

More information

Chapter-8. Conclusion and Future Scope

Chapter-8. Conclusion and Future Scope Chapter-8 Conclusion and Future Scope This thesis has addressed the problem of Spam E-mails. In this work a Framework has been proposed. The proposed framework consists of the three pillars which are Legislative

More information

Content Based Spam Filtering

Content Based Spam  Filtering 2016 International Conference on Collaboration Technologies and Systems Content Based Spam E-mail Filtering 2nd Author Pingchuan Liu and Teng-Sheng Moh Department of Computer Science San Jose State University

More information

CHEAP, efficient and easy to use, has become an

CHEAP, efficient and easy to use,  has become an Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 A Multi-Resolution-Concentration Based Feature Construction Approach for Spam Filtering Guyue Mi,

More information

Effective Scheme for Reducing Spam in System

Effective Scheme for Reducing Spam in  System Effective Scheme for Reducing Spam in Email System 1 S. Venkatesh, 2 K. Geetha, 3 P. Manju Priya, 4 N. Metha Rani 1 Assistant Professor, 2,3,4 UG Scholar Department of Computer science and engineering

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

CSI5387: Data Mining Project

CSI5387: Data Mining Project CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play

More information

Improving the methods of classification based on words ontology

Improving the methods of  classification based on words ontology www.ijcsi.org 262 Improving the methods of email classification based on words ontology Foruzan Kiamarzpour 1, Rouhollah Dianat 2, Mohammad bahrani 3, Mehdi Sadeghzadeh 4 1 Department of Computer Engineering,

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

New Developments in the SpamPots Project

New Developments in the SpamPots Project New Developments in the SpamPots Project Klaus Steding-Jessen Cristine Hoepers CERT.br CERT Brazil http://www.cert.br/ NIC.br Brazilian Network Information Center http://www.nic.br/

More information

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

Keywords : Bayesian,  classification, tokens, text, probability, keywords. GJCST-C Classification: E.5 Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

MEASURING AND FINGERPRINTING CLICK-SPAM IN AD NETWORKS

MEASURING AND FINGERPRINTING CLICK-SPAM IN AD NETWORKS MEASURING AND FINGERPRINTING CLICK-SPAM IN AD NETWORKS Vacha Dave *, Saikat Guha and Yin Zhang * * The University of Texas at Austin Microsoft Research India Internet Advertising Today 2 Online advertising

More information

Introduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple

Introduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple Table of Contents Introduction...2 Overview...3 Common techniques to identify SPAM...4 Greylisting...5 Dictionary Attack...5 Catchalls...5 From address...5 HELO / EHLO...6 SPF records...6 Detecting SPAM...6

More information

Spam Decisions on Gray using Personalized Ontologies

Spam Decisions on Gray  using Personalized Ontologies Spam Decisions on Gray E-mail using Personalized Ontologies Seongwook Youn Semantic Information Research Laboratory (http://sir-lab.usc.edu) Dept. of Computer Science Univ. of Southern California Los Angeles,

More information

BitDefender Antispam NeuNet

BitDefender Antispam NeuNet BitDefender Antispam NeuNet Whitepaper Cosoi Alexandru Catalin Researcher BitDefender AntiSpam Laboratory Contents 1. Overview of the Spam Issue 2. About Neural Networks 3. New Structure Using Neural Networks

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

Diagnosis of Spams Some Statistical Considerations

Diagnosis of  Spams Some Statistical Considerations International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 3, Issue 4 (August 2012), PP. 05-09 Diagnosis of Email Spams Some Statistical Considerations

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

An Empirical Study of Behavioral Characteristics of Spammers: Findings and Implications

An Empirical Study of Behavioral Characteristics of Spammers: Findings and Implications An Empirical Study of Behavioral Characteristics of Spammers: Findings and Implications Zhenhai Duan, Kartik Gopalan, Xin Yuan Abstract In this paper we present a detailed study of the behavioral characteristics

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Phishing Activity Trends Report August, 2005

Phishing Activity Trends Report August, 2005 Phishing Activity Trends Report August, 25 Phishing is a form of online identity theft that employs both social engineering and technical subterfuge to steal consumers' personal identity data and financial

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

2. Design Methodology

2. Design Methodology Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily

More information

Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning

Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning Jing Ma 1, Wei Gao 2*, Kam-Fai Wong 1,3 1 The Chinese University of Hong Kong 2 Victoria University of Wellington, New Zealand

More information

Chapter 8 The C 4.5*stat algorithm

Chapter 8 The C 4.5*stat algorithm 109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the

More information

The evolution of malevolence

The evolution of malevolence Detection of spam hosts and spam bots using network traffic modeling Anestis Karasaridis Willa K. Ehrlich, Danielle Liu, David Hoeflin 4/27/2010. All rights reserved. AT&T and the AT&T logo are trademarks

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---

More information

Extraction of Web Image Information: Semantic or Visual Cues?

Extraction of Web Image Information: Semantic or Visual Cues? Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus

More information

BOTNET-GENERATED SPAM

BOTNET-GENERATED SPAM BOTNET-GENERATED SPAM By Areej Al-Bataineh University of Texas at San Antonio MIT Spam Conference 2009 www.securitycartoon.com 3/27/2009 Areej Al-Bataineh - Botnet-generated Spam 2 1 Botnets: A Global

More information

A study of Video Response Spam Detection on YouTube

A study of Video Response Spam Detection on YouTube A study of Video Response Spam Detection on YouTube Suman 1 and Vipin Arora 2 1 Research Scholar, Department of CSE, BITS, Bhiwani, Haryana (India) 2 Asst. Prof., Department of CSE, BITS, Bhiwani, Haryana

More information

An approach for Malicious Spam Detection In with comparison of different classifiers

An approach for Malicious Spam Detection In  with comparison of different classifiers An approach for Malicious Spam Detection In Email with comparison of different classifiers Umesh Kumar Sah 1,Narendra Parmar 2 1M.Tech Scholar, 2 Assistant Professor, 1,2 Sri Satya Sai College of Engineering,

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Accuracy Analysis of Neural Networks in removal of unsolicited s

Accuracy Analysis of Neural Networks in removal of unsolicited  s Accuracy Analysis of Neural Networks in removal of unsolicited e-mails P.Mohan Kumar P.Kumaresan S.Yokesh Babu Assistant Professor (Senior) Assistant Professor Assistant Professor (Senior) SITE SITE SCSE

More information

An Overview of Concept Based and Advanced Text Clustering Methods.

An Overview of Concept Based and Advanced Text Clustering Methods. An Overview of Concept Based and Advanced Text Clustering Methods. B.Jyothi, D.Sailaja, Dr.Y.Srinivasa Rao, GITAM, ANITS, GITAM, Asst.Professor Asst.Professor Professor Abstract: Most of the common techniques

More information

Inferring User Search for Feedback Sessions

Inferring User Search for Feedback Sessions Inferring User Search for Feedback Sessions Sharayu Kakade 1, Prof. Ranjana Barde 2 PG Student, Department of Computer Science, MIT Academy of Engineering, Pune, MH, India 1 Assistant Professor, Department

More information

Filtering Unwanted Messages from (OSN) User Wall s Using MLT

Filtering Unwanted Messages from (OSN) User Wall s Using MLT Filtering Unwanted Messages from (OSN) User Wall s Using MLT Prof.Sarika.N.Zaware 1, Anjiri Ambadkar 2, Nishigandha Bhor 3, Shiva Mamidi 4, Chetan Patil 5 1 Department of Computer Engineering, AISSMS IOIT,

More information

Text Classification for Spam Using Naïve Bayesian Classifier

Text Classification for  Spam Using Naïve Bayesian Classifier Text Classification for E-mail Spam Using Naïve Bayesian Classifier Priyanka Sao 1, Shilpi Chaubey 2, Sonali Katailiha 3 1,2,3 Assistant ProfessorCSE Dept, Columbia Institute of Engg&Tech, Columbia Institute

More information

Spam Classification Documentation

Spam Classification Documentation Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Analysis of Naïve Bayes Algorithm for Spam Filtering across Multiple Datasets

Analysis of Naïve Bayes Algorithm for  Spam Filtering across Multiple Datasets IOP Conference Series: Materials Science and Engineering PAPER OPEN ACCESS Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets To cite this article: Nurul Fitriah Rusland

More information

Review Spam Analysis using Term-Frequencies

Review Spam Analysis using Term-Frequencies Volume 03 - Issue 06 June 2018 PP. 132-140 Review Spam Analysis using Term-Frequencies Jyoti G.Biradar School of Mathematics and Computing Sciences Department of Computer Science Rani Channamma University

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),

More information

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

An Automatic Reply to Customers  Queries Model with Chinese Text Mining Approach Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

The importance of Whois data bases for spam enforcement

The importance of Whois data bases for spam enforcement The importance of Whois data bases for spam enforcement Chris Fonteijn Chairman OPTA Joint meeting GAC/GNSO Marrakech, Monday 26 June 2006 1 Introduction My name is Chris Fonteijn and I am chairman of

More information

3.5 SECURITY. How can you reduce the risk of getting a virus?

3.5 SECURITY. How can you reduce the risk of getting a virus? 3.5 SECURITY 3.5.4 MALWARE WHAT IS MALWARE? Malware, short for malicious software, is any software used to disrupt the computer s operation, gather sensitive information without your knowledge, or gain

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science 310 Million + Current Domain Names 11 Billion+ Historical Domain Profiles 5 Million+ New Domain Profiles Daily

More information

The Challenge of Spam An Internet Society Public Policy Briefing

The Challenge of Spam An Internet Society Public Policy Briefing The Challenge of Spam An Internet Society Public Policy Briefing 30 October 2015 Introduction Spam email, those unsolicited email messages we find cluttering our inboxes, are a challenge for Internet users,

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

Why we spam? 1. To get Bank Logs by spamming different banks.

Why we spam? 1. To get Bank Logs by spamming different banks. Hello guys this is tutorial in depth of the topic spamming. First of we will see what do we mean by term spamming. Wikipedia definition: Email spam, also known as unsolicited bulk Email (UBE), junk mail,

More information

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS Nilam B. Lonkar 1, Dinesh B. Hanchate 2 Student of Computer Engineering, Pune University VPKBIET, Baramati, India Computer Engineering, Pune University VPKBIET,

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Domain name system black list false reporting attack

Domain name system black list false reporting attack Domain name system black list false reporting attack Ing. Miloš Očkay, PhD 1, Ing. Martin Javurek 2, 1 Department of Informatics Armed Forces Academy of gen. M. R. Štefánik Liptovský Mikuláš, Slovakia

More information

Method to Study and Analyze Fraud Ranking In Mobile Apps

Method to Study and Analyze Fraud Ranking In Mobile Apps Method to Study and Analyze Fraud Ranking In Mobile Apps Ms. Priyanka R. Patil M.Tech student Marri Laxman Reddy Institute of Technology & Management Hyderabad. Abstract: Ranking fraud in the mobile App

More information

HOT ZONE IDENTIFICATION: ANALYZING EFFECTS OF DATA SAMPLING ON SPAM CLUSTERING

HOT ZONE IDENTIFICATION: ANALYZING EFFECTS OF DATA SAMPLING ON SPAM CLUSTERING HOT ZONE IDENTIFICATION: ANALYZING EFFECTS OF DATA SAMPLING ON SPAM CLUSTERING Rasib Khan, Mainul Mizan, Ragib Hasan, Alan Sprague Department of Computer and Information Sciences University of Alabama

More information

Non-ML Anti-Spamming: A Role Based Solution

Non-ML Anti-Spamming: A Role Based Solution Non-ML Anti-Spamming: A Role Based Solution Anthony Y. Fu, Email: anthony@cs.cityu.edu.hk WebPage: http://www.cs.cityu.edu.hk/~anthony Department of Computer Science, City University of Hong Kong Hong

More information

Clustering Potential Phishing Websites Using DeepMD5 Abstract 1. Introduction

Clustering Potential Phishing Websites Using DeepMD5 Abstract 1. Introduction Clustering Potential Phishing Websites Using DeepMD5 Jason Britt, Brad Wardman, Dr. Alan Sprague, Gary Warner Department of Computer & Inf. Sciences University of Alabama at Birmingham Birmingham, AL 35294

More information

Topic Classification in Social Media using Metadata from Hyperlinked Objects

Topic Classification in Social Media using Metadata from Hyperlinked Objects Topic Classification in Social Media using Metadata from Hyperlinked Objects Sheila Kinsella 1, Alexandre Passant 1, and John G. Breslin 1,2 1 Digital Enterprise Research Institute, National University

More information

International Journal Of Engineering Research & Management Technology

International Journal Of Engineering Research & Management Technology International Journal Of Engineering Research & Management Technology ISSN: 2348-4039 Email: editor@ijermt.org July- 2014 Volume 1, Issue-4 www.ijermt.org Document Clustering For Digital Devices: An Approach

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Classifying and Predicting Spam Messages Using Text Mining in SAS Enterprise Miner Session ID: 2650

Classifying and Predicting Spam Messages Using Text Mining in SAS Enterprise Miner Session ID: 2650 Classifying and Predicting Spam Messages Using Text Mining in SAS Enterprise Miner Session ID: 2650 Mounika Kondamudi, Oklahoma State University Mentored by Balamurugan Mohan, H&R Block SAS and all other

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK ANALYSIS OF WEB USAGE MINING TECHNIQUES FOR WEB CRIME PATTERNS OF THE WEB USERS

More information

ANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES

ANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES ANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES Shaufiah, Imanudin and Ibnu Asror Mining Center Laboratory, School of Computing, Telkom University, Bandung, Indonesia E-Mail:

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Fighting the. Botnet Ecosystem. Renaud BIDOU. Page 1

Fighting the. Botnet Ecosystem. Renaud BIDOU. Page 1 Fighting the Botnet Ecosystem Renaud BIDOU Page 1 Bots, bots, bots Page 2 Botnet classification Internal Structure Command model Propagation mechanism 1. Monolithic Coherent, all features in one binary

More information

An Efficient Clustering for Crime Analysis

An Efficient Clustering for Crime Analysis An Efficient Clustering for Crime Analysis Malarvizhi S 1, Siddique Ibrahim 2 1 UG Scholar, Department of Computer Science and Engineering, Kumaraguru College Of Technology, Coimbatore, Tamilnadu, India

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Schematizing a Global SPAM Indicative Probability

Schematizing a Global SPAM Indicative Probability Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens,

More information

A platform for automatic identification of phishing URLs in mobile text messages

A platform for automatic identification of phishing URLs in mobile text messages Journal of Physics: Conference Series PAPER OPEN ACCESS A platform for automatic identification of phishing URLs in mobile text messages To cite this article: Xiang Xun Sun et al 208 J. Phys.: Conf. Ser.

More information

Reference Point Detection for Arch Type Fingerprints

Reference Point Detection for Arch Type Fingerprints Reference Point Detection for Arch Type Fingerprints H.K. Lam 1, Z. Hou 1, W.Y. Yau 1, T.P. Chen 1, J. Li 2, and K.Y. Sim 2 1 Computer Vision and Image Understanding Department Institute for Infocomm Research,

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar, Youngstown State University Bonita Sharif, Youngstown State University Jenna Wise, Youngstown State University Alyssa Pawluk, Youngstown

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information