A Reputation-based Collaborative Approach for Spam Filtering

Similar documents
Mailspike. Henrique Aparício

Collaborative Spam Mail Filtering Model Design

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Schematizing a Global SPAM Indicative Probability

Computer aided mail filtering using SVM

CPSC156a: The Internet Co-Evolution of Technology and Society

Untitled Page. Help Documentation

SPAM PRECAUTIONS: A SURVEY

Hit the Ground Spam(fight)ing

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

Content Based Spam Filtering

Smart elab volume 3, anno 2014 Networking. Setup of a clustered antispam and antivirus service based on Mailcleaner suite

Handling unwanted . What are the main sources of junk ?

Bayesian Spam Filtering Using Statistical Data Compression

Ethical Hacking and. Version 6. Spamming

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Collaborative Filtering

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine

Testing? Here s a Second Opinion. David Koconis, Ph.D. Senior Technical Advisor, ICSA Labs 01 October 2010

Introduction to Antispam Practices

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow

Design and Implementation of a DMARC Verification Result Notification System

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

P2P Contents Distribution System with Routing and Trust Management

Collaborative Filtering. Doug Herbers Master s Oral Defense June 28, 2005

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

The Research on PGP Private Key Ring Cracking and Its Application

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM

Anti-Spam Processing at UofH

Telemetry Data Sharing Using S/MIME

Cryptography and Network Security. Sixth Edition by William Stallings

Overview of the TREC 2005 Spam Track. Gordon V. Cormack Thomas R. Lynam. 18 November 2005

Account Customer Portal Manual

An Empirical Study of Behavioral Characteristics of Spammers: Findings and Implications

Mathematical Model of Network Address Translation Port Mapping

Deliverability Terms

Experience with SPM in IPv6

Extract of Summary and Key details of Symantec.cloud Health check Report

to Stay Out of the Spam Folder

Content Filters. Overview of Content Filters. How Content Filters Work. This chapter contains the following sections:

Advanced Spam Detection Methodology by the Neural Network Classifier

APCAUCE / APRICOT Kuala Lumpur Dave Crocker Brandenburg InternetWorking <

An advanced data leakage detection system analyzing relations between data leak activity

Based on material produced by among others: Sanjay Pol, Ashok Ramaswami, Jim Fenton and Eric Allman

Northeastern University in TREC 2009 Million Query Track

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Entering the China Market

Intrusion Detection and Violation of Compliance by Monitoring the Network

s and Anti-spam

Bayesian Spam Detection System Using Hybrid Feature Selection Method

TrendMicro Hosted Security. Best Practice Guide

Malware, , Database Security

TOWARD PRIVACY PRESERVING AND COLLUSION RESISTANCE IN A LOCATION PROOF UPDATING SYSTEM

Filtering Spam by Using Factors Hyperbolic Trees

DomainKeys Identified Mail Overview (-01) Eric Allman Sendmail, Inc.

Deploying a New Hash Algorithm. Presented By Archana Viswanath

Unknown Malicious Code Detection Based on Bayesian

Security with FailSafe

Table of content. Authentication Domain Subscribers Content Sending practices Conclusion...

An improved security model for identity authentication against cheque payment fraud in Tanzanian banks

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

A Content Vector Model for Text Classification

Authentication GUIDE. Frequently Asked QUES T ION S T OGETHER STRONGER

Defining Which Hosts Are Allowed to Connect Using the Host Access Table

Princess Nora Bint Abdulrahman University College of computer and information sciences Networks department Networks Security (NET 536)

Efficacious Spam Filtering and Detection in Social Networks

Probabilistic Anti-Spam Filtering with Dimensionality Reduction

Comparative analysis of classifiers for header based s classification using supervised learning

Technical Approaches to Spam and Standards Activities (ITU WSIS Spam Conference)

Internet Security Enhanced Security Services for S/MIME. Thomas Göttlicher

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Open Mic: IBM SmartCloud Notes Mail Hygiene. Robert Newell SmartCloud Notes Support July, 20 th 2016

A Secure Routing Protocol for Wireless Adhoc Network Creation

Introduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple

Version 5.2. SurfControl Filter for SMTP Administrator s Guide

Using Cross-Media Relations to Identify Important Communication Requests: Testing the Concept and Implementation

Discuss principles of effective e-marketing. Challenges & solutions for improved reader response

DEFENDING AGAINST MALICIOUS NODES USING AN SVM BASED REPUTATION SYSTEM

Web Mail and e-scout Instructions

Guide to Marketing

URL ATTACKS: Classification of URLs via Analysis and Learning

Teach Me How: B2B Deliverability in a B2C World

Anti-Spoofing. Inbound SPF Settings

Innovation IT Services Price List

An intelligent three-phase spam filtering method based on decision tree data mining

Filtering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Deliverability Blasting... 2 White Listing... 2 How To Avoid Junk Mail Filters... 3 What Words To Avoid Frequency...

Optimization of your deliverability: set up & best practices. Jonathan Wuurman, ACTITO Evangelist

Bayesian Spam Detection

Internet Engineering Task Force (IETF) Category: Standards Track April 2011 ISSN:

Information Retrieval

Defining Which Hosts Are Allowed to Connect Using the Host Access Table

Decision Science Letters

COSC 301 Network Management. Lecture 14: Electronic Mail

Ontology-Based Web Query Classification for Research Paper Searching

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Robocall and fake caller-id detection

Getting into Gmail and other inboxes: A marketer's guide to the toughest spam filters

Transcription:

Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for Spam Filtering Wenxuan Shi a, Maoqiang Xie b a College of Software, Nankai University, Tianjin 300071, China b College of Software, Nankai University, Tianjin 300071, China Abstract Spam and spam filters are contrarious components of a complex interdependent social ecosystem. Traditional spam filtering techniques or systems are usually designed and deployed individually that neglect the distributed and bulk characteristics of spam. This paper proposes a reputation-based collaborative anti-spam approach. This approach, adopting fingerprinting technique, evaluating reporters trust, achieved better performance and robustness than the state-of-the-art in comparison experiments on several known email corpora. 2013 The Published Authors. by Published Elsevier by B.V. Elsevier Selection B.V. and/or peer review under responsibility of American Applied Selection Science Research and/or peer Institute review under responsibility of American Applied Science Research Institute Keywords: Spam filtering; reputation evaluation; collaborative; fingerprinting 1. Introduction There is frequent and close mutual relationship between people and data on web, and such relationship offers some lawbreakers potential opportunity to send all kinds of spam information including spam email, spam poster, spam invitation, spam advertisement, etc. According to characteristic of spam information, there are many definitions about spam in practice, such as unsolicited and unwanted email, indiscriminate bulk email sent directly or indirectly, unsolicited commercial email, etc. In addition to wasting recipients' time to deal with spam, spam also eats up a lot of network bandwidth. Moreover, spam occupies vast resources of computation and storage and it is profoundly annoying clients and email service providers (ESP). Consequently, many anti-spam techniques and systems, known as spam filters, have appeared associated with the appearance of spam. Such situation is called as spam ecosystem, that spam and spam filters are components of a complex interdependent system of social and technical structures. The anti-spam techniques 2212-6716 2013 The Authors. Published by Elsevier B.V. Selection and/or peer review under responsibility of American Applied Science Research Institute doi:10.1016/j.aasri.2013.10.082

Wenxuan Shi and Maoqiang Xie / AASRI Procedia 5 ( 2013 ) 220 227 221 can be categorized into two types: anti-spam techniques based on EK (expert knowledge) and anti-spam techniques based on ML (machine learning). Anti-spam techniques based on EK contain rule-based filtering, such as whitelists, blacklists, challenge-response, enhanced protocol, etc. There have been many spam filters based on EK, such as Sender Policy Framework (SPF) [1], Sender-ID [2], and Domain Keys Identified Mail (DKIM) [3], etc. Anti-spam techniques based on ML contain probability-based filtering, linear classifiers, Rocchio method, nearest neighbor method, logic-based method, data compression model, etc. Most of recent spam systems are individual spam filters which are designed and deployed individually according to various clients and ESPs. Considering the distributed and bulk characteristics of spam, the better schema of anti-spam system should be implemented depend on multi-client or multi-esp collaboration and has individuals sharing their judgments of legitimate email and spam. This paper proposes a reputation-based collaborative anti-spam approach, adopting fingerprinting technique, evaluating reporters trust, and it outperformed the state-of-the-art in comparison experiments on several known email corpora. 2. Related work In spam ecosystem, spam filters should pay close attention to two issues: one is how to protect recipients privacy and information safety, and another is how to utilize the distributed and bulk characteristics of spam. 2.1. Fingerprinting Fingerprint and fingerprint recognition are concepts originated from the area of biometric authenticate identity. After applying them in the area of information retrieval, fingerprint is defined as short tag for large object. Fingerprinting technique has the advantage that it can identify the same or similar duplicated documents with small partial variations by calculating and matching their hash values. In spam ecosystem, spam information has produced massive waste to network resource and person s time, such as mass data transmission on internet and mass storage on servers. On the other hand, spam information has produced serious damage to web information safety and personal information privacy. Fingerprinting schema is derived from traditional research domain of data encryption and digital signature. This schema adopts certain encryption hash arithmetic by which to generate shorter content digests for large original messages in order to take the place of original information storage and transmission. Fingerprint functions may be seen as high-performance hash functions and there are two kinds of known algorithms: Rabin's algorithm and cryptographic hash functions [4]. 2.2. Collaborative spam filtering Collaborative spam filtering is more efficient strategy to content filtering where rather than employing someone or certain computers to attract and analyze spam, and rather than having different user train himself individual filters. The whole collaborative community works together with shared spam knowledge. Therefore a collaborative spam filter needs certain shared and efficient database where storing different user s judgment about which is spam and which not. At present there has been some collaborative spam filters on web, such as DCC, Vipul's Razor, Pyzor, Cloudmark, etc[5, 6, 7]. These methods have similar strategy of using shared knowledge. The DCC (Distributed Checksum Clearinghouse) system detects spam by computing spam checksums and querying the same checksums in a database of checksums. Vipul's Razor system filters out known spam by maintaining a catalogue server of signatures feedback by spam receivers. Pyzor and Cloudmark have different protocols of spam filtering similar to Vipul's Razor.

222 Wenxuan Shi and Maoqiang Xie / AASRI Procedia 5 ( 2013 ) 220 227 3. Reputation Evaluation This paper proposes a collaborative approach for spam filtering based on reputation evaluation with weighing fingerprints and shared fingerprints database by means of which we can record, query, log, report and amend shared fingerprints, as shown in Figure 1. Fig. 1. Reputation evaluation with weighing fingerprints 3.1. Fingerprinting based on MIME-division Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email to sup-port a variety of format content, such as text in multiple character sets, non-text attachments, and message bodies with multiple parts, etc. However, the traditional spam filtering techniques commonly just consider plain text content of email but ignore MIME Features, and only considered the text content, as well as some released email corpora to public which only contain text content extracted from original email body. In our work, the anti-spam system handles spam as follows: 1) Divide each incoming email into five subparts: email header, text/plain content of email body, text/html content of email body, embedded resources and attachments. 2) Generate weighing fingerprint for different MIME subpart. 3) Compute an indicator score for each fingerprint set to indicate how spammy the fingerprint set is. 4) Calculate a compound weighted score based on individual indicator score. 5) Make a decision of spam or non-spam to the incoming email by comparing the compound weighted score with certain predefined threshold value. Define an incoming email as symbol, after the processing of MIME-division, the email is turned into a subparts set: with the weighting vector. Suppose the subpart can be expressed as a feature vector. Suppose the generated fingerprint set of is. The compound weighted score could be calculated as follows:

Wenxuan Shi and Maoqiang Xie / AASRI Procedia 5 ( 2013 ) 220 227 223 (1) where is the indicator score of fingerprint which can be trained, maintained and queried from certain fingerprints database. In the fingerprints database, we conserve every fingerprint information as a triplet where is a global unique hash value calculated by fingerprinting arithmetic, is the life cycle or generated fingerprint. 3.2. Reputation evaluation In our spam filtering system, we build a reputation evaluation method with weighing fingerprints and shared fingerprints database. By creating the shared fingerprints database, we can record new arrival fingerprints, query stored fingerprints, log users operations, report searching results and amend outdated fingerprints. To a reporter of the collaborative spam filtering system, we calculate the reporter s reputation value as follows: is current reputation value of reporter calculated by last reputation and current feedback result multiply by weighing factor, is reporter feedback result to certain email-m, which can be calculated as follows: (2) (3) is weighing factor for different kinds of feedback rolls in order to balance reporter feedback result: (4) where is pre-defined adjusting parameter for different feedback rolls defined by different source types of reporter. 3.3. Indicator score calculation To judge which is spam and which not on the basis of feedback, the collaborative anti-spam approach based on reputation calculates an indicator score for each generated fingerprint after MIME-division and weighting-assignment as follows:

224 Wenxuan Shi and Maoqiang Xie / AASRI Procedia 5 ( 2013 ) 220 227 (5) where, the indicator score of fingerprint, is the second element value of the triplet which is stored in fingerprints database. is the reputation value of reporter which can be configured and adjusted over time based on different reporter rolls defined as above( ). is the probabilistic value of feature calculated based on IDF (inverse document frequency). 3.4. Spam detection Spammers often control some puppet PCs and engaged servers to send mass spam. It is the purpose of spam filters to recognize these machines and senders. Spam filtering is the automatic processing to distinguish spam from non-spam between incoming emails according to specified criteria. After the steps of fingerprinting based on MIME-Division, and indicator score calculation based on reputation evaluation, we can generate a diverter for spam and non-spam as shown in Figure 2. Fig. 2. Diverter for spam and non-spam

Wenxuan Shi and Maoqiang Xie / AASRI Procedia 5 ( 2013 ) 220 227 225 In order to predict the label of an incoming email and then put into appropriate email folders, we can execute the judgment as follows: (6) where t is a pre-defined judgment threshold of legitimate email and spam. 4. Experiments In order to examine the performance of our approach (Collaborative Anti-Spam Scheme, CAS3 for short) proposed in this paper, we performed three groups of experiments based on three different corpora. 4.1. Experiment evaluation measure We performed three groups of experiments based on three different corpora (Table 1) where the Handwork corpus is collected by our previous accumulation which contains whole email content, such as attachments, embedded resources, etc. Table 1. Corpora used in comparison experiments Corpus Spam Non-Spam MIME Ling-Spam [8] 481 2412 Subject, Text/Plain Spam-Assassin [9] 1897 4150 Five Subparts Handwork 4059 831 Five Subparts In each group of experiments, we compare CAS3 with two known spam filters: SpamAssassin and Vipul's Razor. In comparison experiments, we draw certain Receiver Operating Characteristic curve to show different filtering effects between these methods. 4.2. Experiment evaluation results The first group of experiments is simulated based on Ling-Spam corpus and the comparison results are shown in Figure 3. Ling-Spam Corpus Misclassified Non-Spams (of 2412) Misclassified Spams (of 481) Fig. 3. ROC curve graph on Ling-Spam corpus

226 Wenxuan Shi and Maoqiang Xie / AASRI Procedia 5 ( 2013 ) 220 227 The second group of experiments is simulated based on Spam-Assassin corpus and the comparison results are shown in Figure 4. Spam-Assassin Corpus Misclassified Non-Spams (of 4150) Misclassified Spams (of 1897) Fig. 4. ROC curve graph on Spam-Assassin corpus The third group of experiments is simulated based on Handwork corpus and the comparison results are shown in Figure 5. Handwork Corpus Misclassified Non-Spams (of 831) Misclassified Spams (of 4059) Fig. 5. ROC curve graph on Handwork corpus 5. Conclusions In this paper, a reputation-based collaborative approach for spam filtering has been proposed that using the MIME features of email and adopts fingerprinting schema according to different subparts of email. Our approach achieved better performance and robustness than current popular filtering methods on several email corpora. Corresponding Author: Wenxuan Shi, shiwx@nankai.edu.cn, 086-13920561100 References [1] M. Wong and W. Schlitt. 2006. Sender Policy Framework (SPF) for Authorizing Use of Domains in E-

Wenxuan Shi and Maoqiang Xie / AASRI Procedia 5 ( 2013 ) 220 227 227 mail, Vol. RFC 4408. [2] J. Lyon and M. Wong. 2006. Sender-ID: Authenticating E-mail RFC 4406, Internet Engineering Task Force. [3] B. Lieba and J. Fenton. 2007. DomainKeys identified email (DKIM): Using digital signatures for domain verification. in CEAS 2007: The Third Conference on Email and Anti-Spam. [4] wikipedia.org. 2012. Fingerprint (computing). http://en.wikipedia.org/wiki/fingerprint_(computing). Andrew G. West, Avantika Agrawal, etc. 2011. Autonomous link spam detection in purely collaborative environments. In WikiSym 2011: The 7th International Symposium on Wikis and Open Collaboration. Network Security, pp. 15 17. [5] Wenxuan Shi, Maoqiang Xie, etc. 2011. Collaborative Spam Filtering Technique Based on MIME Fingerprints. In WCICA 2011: The 9th World Congress on Intelligent Control and Automation. Washington, DC, USA: IEEE Computer Society. 2011. [6] Wenxuan Shi, Maoqiang Xie, etc. 2011. Cooperative Anti-Spam System Based on Multilayer Agents. In WWW 2011: The 20th International World Wide Web Conference. NY, USA: ACM. 2011. 415~420. [7] Stason.org. 2006. Anti-SPAM Techniques: Collaborative Content Filtering. http://stason.org/articles/technology/email/junk-mail/collaborative_content_filtering.html. [8] I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, and C.D. Spyropoulos. 2000. An Evaluation of Naive Bayesian Anti-Spam Filtering. in Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pp. 9-17. [9] Spamassassin.org. 2003. The Spamassassin Public Mail Corpus. http://spamassassin.apache.org/publiccorpus.