Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Similar documents
Detecting Malicious Web Links and Identifying Their Attack Types

An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack

Detecting Spam Web Pages

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Part I: Data Mining Foundations

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

Naïve Bayes for text classification

Evolutionary Study of Web Spam: Webb Spam Corpus 2011 versus Webb Spam Corpus 2006

A Content Vector Model for Text Classification

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Chapter 6: Information Retrieval and Web Search. An introduction

Finding the Linchpins of the Dark Web: A Study on Topologically Dedicated Hosts on Malicious Web Infrastructures

Link Analysis in Web Mining

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

The Security Role for Content Analysis

Information Retrieval

2. Design Methodology

Search Engines. Information Retrieval in Practice

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

Chapter 27 Introduction to Information Retrieval and Web Search

Collaborative Filtering. Doug Herbers Master s Oral Defense June 28, 2005

Mining Web Data. Lijun Zhang

Application of Support Vector Machine Algorithm in Spam Filtering

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Spam Classification Documentation

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST

Introduction to Data Mining

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Chapter-8. Conclusion and Future Scope

Evaluating Classifiers

Web Spam Challenge 2008

Using AdaBoost and Decision Stumps to Identify Spam

MEASURING AND FINGERPRINTING CLICK-SPAM IN AD NETWORKS

Analyzing and Detecting Review Spam

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

CS145: INTRODUCTION TO DATA MINING

Classifying Spam using URLs

Mining Web Data. Lijun Zhang

Evaluating Classifiers

Website Report for bangaloregastro.com

Adversarial Web Search. Contents

CS229 Final Project: Predicting Expected Response Times

Automatic Summarization

Text Categorization (I)

Spam Decisions on Gray using Personalized Ontologies

Big Data Analytics CSCI 4030

Logistic Regression: Probabilistic Interpretation

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Identifying Suspended Accounts In Twitter

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

deseo: Combating Search-Result Poisoning Yu USF

VECTOR SPACE CLASSIFICATION

CHEAP, efficient and easy to use, has become an

Identifying Important Communications

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

Information Retrieval

Measuring Similarity to Detect

Learning to Detect Web Spam by Genetic Programming

Cloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao

Spice UK. Susan Hallam. Susan Hallam Page 1. Spice UK. Agenda for Today

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Information Retrieval Spring Web retrieval

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Information Retrieval. (M&S Ch 15)

On the automatic classification of app reviews

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Website Report for

Discovering Advertisement Links by Using URL Text

CS47300 Web Information Search and Management

New Issues in Near-duplicate Detection

CS249: ADVANCED DATA MINING

CLASSIFICATION JELENA JOVANOVIĆ. Web:

CSI5387: Data Mining Project

An Introduction to Search Engines and Web Navigation

Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics. Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng

On Detecting Deception

Classification Algorithms in Data Mining

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Identifying Web Spam With User Behavior Analysis

Building Search Applications

Efficacious Spam Filtering and Detection in Social Networks

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Spam Filtering Using Statistical Data Compression Models

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Bayesian Spam Detection

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

Transcription:

Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008

Overview Introduction Countering Email Spam Problem Description Classification History Ongoing Research Countering Web Spam Problem Description Classification History Ongoing Research Conclusions

Introduction The Internet has spawned numerous information-rich environments Email Systems World Wide Web Social Networking Communities Openness facilities information sharing, but it also makes them vulnerable

Denial of Information (DoI) Attacks Deliberate insertion of low quality information (or noise) into information-rich environments Information analog to Denial of Service (DoS) attacks Two goals Promotion of ideals by means of deception Denial of access to high quality information Spam is the currently the most prominent example of a DoI attack

Overview Introduction Countering Email Spam Problem Description Classification History Ongoing Research Countering Web Spam Problem Description Classification History Ongoing Research Conclusions

Countering Email Spam Close to 200 billion (yes, billion) emails are sent each day Spam accounts for around 90% of that email traffic ~2 million spam messages every second

Old Email Spam Examples

Problem Description Email spam detection can be modeled as a binary text classification problem Two classes: spam and legitimate (non-spam) Example of supervised learning Build a model (classifier) based on training data to approximate the target function Construct a function φ: M {spam, legitimate} such that it overlaps Φ: M {spam, legitimate} as much as possible

Problem Description (cont.) How do we represent a message? How do we generate features? How do we process features? How do we evaluate performance?

How do we represent a message? Classification algorithms require a consistent format Salton s vector space model ( bag of words ) is the most popular representation Each message m is represented as a feature vector f of n features: <f 1, f 2,, f n >

How do we generate features? Sources of information SMTP connections Network properties Email headers Social networks Email body Textual parts URLs Attachments

How do we process features? Feature Tokenization Alphanumeric tokens N-grams Phrases Feature Scrubbing Stemming Stop word removal Feature Selection Simple feature removal Information-theoretic algorithms

How do we evaluate performance? Traditional IR metrics Precision vs. Recall False positives vs. False negatives Imbalanced error costs P = d b + d R = c d + d ROC curves FP = a b + b FN = c c + d

Classification History Sahami et al. (1998) Used a Naïve Bayes classifier Were the first to apply text classification research to the spam problem Pantel and Lin (1998) Also used a Naïve Bayes classifier Found that Naïve Bayes outperforms RIPPER

Classification History (cont.) Drucker et al. (1999) Evaluated Support Vector Machines as a solution to spam Found that SVM is more effective than RIPPER and Rocchio Hidalgo and Lopez (2000) Found that decision trees (C4.5) outperform Naïve Bayes and k-nn

Classification History (cont.) Up to this point, private corpora were used exclusively in email spam research Androutsopoulos et al. (2000a) Created the first publicly available email spam corpus (Ling-spam) Performed various feature set size, training set size, stemming, and stop-list experiments with a Naïve Bayes classifier

Classification History (cont.) Androutsopoulos et al. (2000b) Created another publicly available email spam corpus (PU1) Confirmed previous research than Naïve Bayes outperforms a keyword-based filter Carreras and Marquez (2001) Used PU1 to show that AdaBoost is more effective than decision trees and Naïve Bayes

Classification History (cont.) Androutsopoulos et al. (2004) Created 3 more publicly available corpora (PU2, PU3, and PUA) Compared Naïve Bayes, Flexible Bayes, Support Vector Machines, and LogitBoost: FB, SVM, and LB outperform NB Zhang et al. (2004) Used Ling-spam, PU1, and the SpamAssassin corpora Compared Naïve Bayes, Support Vector Machines, and AdaBoost: SVM and AB outperform NB

Classification History (cont.) CEAS (2004 present) Focuses solely on email and anti-spam research Generates a significant amount of academic and industry anti-spam research Klimt and Yang (2004) Published the Enron Corpus the first large-scale corpus of legitimate email messages TREC Spam Track (2005 present) Produces new corpora every year Provides a standardized platform to evaluate classification algorithms

Ongoing Research Concept Drift New Classification Approaches Adversarial Classification Image Spam

Concept Drift Spam content is extremely dynamic Topic drift (e.g., specific scams) Technique drift (e.g., obfuscations) How do we keep up with the Joneses? Batch vs. Online Learning Percentage of Spam Messages 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 OBFUSCATING_COMMENT INTERRUPTUS HTML_FONT_LOW_CONTRAST HTML_TINY_FONT 0 01/03 01/04 01/05 01/06 Month

New Classification Approaches Filter Fusion Compression-based Filtering Network behavioral clustering

Adversarial Classification Classifiers assume a clear distinction between spam and legitimate features Camouflaged messages Mask spam content with legitimate content Disrupt decision boundaries for classifiers

Camouflage Attacks Baseline performance Accuracies consistently higher than 98% Classifiers under attack Accuracies degrade to between 50% and 70% Retrained classifiers Accuracies climb back to between 91% and 99% Weighted Accuracy, Weighted Accuracy, λ = 9 1 0.99 0.98 0.9 0.97 0.8 0.96 0.95 0.7 0.94 0.6 0.93 0.92 0.5 0.91 0.4 0.9 10 20 Naive Bayes SVM LogitBoost 40 80 160 320 Number of of Retained Features 640

Camouflage Attacks (cont.) Retraining postpones the problem, but it doesn t solve it 1 0.8 NaiveBayes SVM LogitBoost We can identify features that are less susceptible to attack, but that s simply Fraction of False Negatives 0.6 0.4 0.2 another stalling technique 0 0 0(A) 1 1(A) 2 2(A) 3 3(A) Round Number (A denotes Attack) 4 4(A)

Image Spam What happens when an email does not contain textual features? OCR is easily defeated Classification using image properties

Overview Introduction Countering Email Spam Problem Description Classification History Ongoing Research Countering Web Spam Problem Description Classification History Ongoing Research Conclusions

Countering Web Spam What is web spam? Traditional definition Our definition Between 13.8% and 22.1% of all web pages

Ad Farms Only contain advertising links (usually ad listings) Elaborate entry pages used to deceive visitors

Ad Farms (cont.) Clicking on an entry page link leads to an ad listing Ad syndicators provide the content Web spammers create the HTML structures

Parked Domains Domain parking services Provide place holders for newly registered domains Allow ad listings to be used as place holders to monetize a domain Inevitably, web spammers abused these services

Parked Domains (cont.) Functionally equivalent to Ad Farms Both rely on ad syndicators for content Both provide little to no value to their visitors Unique Characteristics Reliance on domain parking services (e.g., apps5.oingo.com, searchportal.information.com, etc.) Typically for sale by owner ( Offer To Buy This Domain )

Parked Domains (cont.)

Advertisements Pages advertising specific products or services Examples of the kinds of pages being advertised in Ad Farms and Parked Domains

Problem Description Web spam detection can also be modeled as a binary text classification problem Salton s vector space model is quite common Feature processing and performance evaluation are also quite similar But what about feature generation

How do we generate features? Sources of information HTTP connections Hosting IP addresses Session headers HTML content Textual properties Structural properties URL linkage structure PageRank scores Neighbor properties

Classification History Davison (2000) Was the first to investigate link-based web spam Built decision trees to successfully identify nepotistic links Becchetti et al. (2005) Revisited the use of decision trees to identify linkbased web spam Used link-based features such as PageRank and TrustRank scores

Classification History Drost and Scheffer (2005) Used Support Vector Machines to classify web spam pages Relied on content-based features as well as linkbased features Ntoulas et al. (2006) Built decision trees to classify web spam Used content-based features (e.g., fraction of visible content, compressibility, etc.)

Classification History Up to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets Webb et al. (2006) Presented the Webb Spam Corpus a first-of-its-kind large-scale, publicly available web spam corpus (almost 350K web spam pages) http://www.webbspamcorpus.org Castillo et al. (2006) Presented the WEBSPAM-UK2006 corpus a publicly available web spam corpus (only contains 1,924 web spam pages)

Classification History Castillo et al. (2007) Created a cost-sensitive decision tree to identify web spam in the WEBSPAM-UK2006 data set Used link-based features from [Becchetti et al. (2005)] and content-based features from [Ntoulas et al. (2006)] Webb et al. (2008) Compared various classifiers (e.g., SVM, decision trees, etc.) using HTTP session information exclusively Used the Webb Spam Corpus, WebBase data, and the WEBSPAM-UK2006 data set Found that these classifiers are comparable to (and in many cases, better than) existing approaches

Ongoing Research Redirection Phishing Social Spam

Redirection 144,801 unique redirect chains (1.54 average HTTP redirects) 7% 1% 2% 3% 5% 302 HTTP redirect frame redirect 301 HTTP redirect iframe redirect 43.9% of web spam pages use some form of HTML or JavaScript redirection 8% 11% 14% 49% meta refresh and location.replace() meta refresh meta refresh and location location* Other

Phishing Interesting form of deception that affects email and web users Another form of adversarial classification

Social Spam Comment spam Bulletin spam Message spam

Conclusions Email and web spam are currently two of the largest information security problems Classification techniques offer an effective way to filter this low quality information Spammers are extremely dynamic, generating various areas of important future research

Questions