Collaborative Filtering. Doug Herbers Master s Oral Defense June 28, 2005

Similar documents
Collaborative Filtering

Dell Service Level Agreement for Microsoft Online Services

Mail Services SPAM Filtering

Getting Started 2 Logging into the system 2 Your Home Page 2. Manage your Account 3 Account Settings 3 Change your password 3

Introduction. Logging in. WebQuarantine User Guide

s & Listservs CAHNRS Office Hours October 19, 2016

Introduction. Logging in. WebMail User Guide

Monterey County Office of Education Quarantine (Spam ) System - User Guide

Spam UF. Use and customization instructions for the Barracuda Spam service at the University of Florida.

DONE FOR YOU SAMPLE INTERNET ACCEPTABLE USE POLICY

Collaborative Spam Mail Filtering Model Design

Acceptable Use Policy

Easy Activation Effortless web-based administration that can be activated in as little as one business day - no integration or migration necessary.

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow

SPAM PRECAUTIONS: A SURVEY

New Tools and Tactics from Technology Service. Wayne Gilroy Ross Eaton Tom Janicki Ed Knudsen

The Dilemma: Junk, Spam, or Phishing? How to Classify Unwanted s and Respond Accordingly

Acceptable Use Policy

Creating Rules in Office 365

Ethical Hacking and. Version 6. Spamming

Panda Security. Protection. User s Manual. Protection. Version PM & Business Development Team

Acceptable Use Policy

Service Level Agreement for Microsoft Online Services

II.C.4. Policy: Southeastern Technical College Computer Use

SAYRE AREA SCHOOL DISTRICT TECHNOLOGY TIPS SPAM SASD PROOFPOINT BASICS ON HOW

Using Your New Webmail

Jacksonville State University Acceptable Use Policy 1. Overview 2. Purpose 3. Scope

Objectives CINS/F1-01

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

Comodo Antispam Gateway Software Version 2.1

For example, if a message is both a virus and spam, the message is categorized as a virus as virus is higher in precedence than spam.

Acceptable Use Policy

Deliverability Terms

CPSC156a: The Internet Co-Evolution of Technology and Society

GFI product comparison: GFI MailEssentials vs. McAfee Security for Servers

Computer aided mail filtering using SVM

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

TurnkeyMail 7.x Help. Logging in to TurnkeyMail

Open Mic: IBM SmartCloud Notes Mail Hygiene. Robert Newell SmartCloud Notes Support July, 20 th 2016

How does the Excalibur Technology SPAM & Virus Protection System work?

Creating Threat Prevention Connection Rules

University Information Technology (UIT) Proofpoint Frequently Asked Questions (FAQ)

Using BBC Raw

Electronic mail, or is a quick way of sending messages to people using the internet.

A Reputation-based Collaborative Approach for Spam Filtering

FortiGuard Antispam. Frequently Asked Questions. High Performance Multi-Threat Security Solutions

How to Ensure You Can Access National CASA Member News

Pending U.S. Anti-spam Legislation: A Marketer's Guide

Barracuda Spam Firewall User s Guide

Comodo Antispam Gateway Software Version 2.11

Enterprise SM VOLUME 1, SECTION 5.7: SECURE MANAGED SERVICE

Managing Spam. To access the spam settings in admin panel: 1. Login to the admin panel by entering valid login credentials.

Spam Filtering Solution For The Western Interstate Commission For Higher Education (Wiche)

Spam, Security and SORBS v2.0

Eftel s Anti-Spam Manual

Introduction to Antispam Practices

Cleveland State University General Policy for University Information and Technology Resources

Barracuda Security Gateway User 's Guide 6 and Above

Keystone Acceptable Use Policy

Advanced Settings. Help Documentation

Deployment Guides. Help Documentation

2. Design Methodology

Not Your Mother's Marketing

Dataprise Managed Anti-Spam Console

WITH INTEGRITY

Proofpoint Anti-Spam Software For John Jay College

What's new in Europa?

1 Abstract Data Types (11 Points)

New User Registration:

User Services OBJECTIVES

Acceptable Use and Publishing Policy

Getting into Gmail and other inboxes: A marketer's guide to the toughest spam filters

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

, Rules & Regulations

2 User Guide. Contents

Comodo Comodo Dome Antispam MSP Software Version 2.12

Comodo Antispam Gateway Software Version 2.12

MX Control Console. Administrative User Manual

Guest Wireless Policy

Deliverability: The Battle to the Inbox

MailChimp Basics. A step by step guide to MailChimp Course developed by Virginia Ridley

An Overview of Webmail

NSDA ANTI-SPAM POLICY

GFI product comparison: GFI MailEssentials vs. Barracuda Spam Firewall

Internet Level Spam Detection and SpamAssassin Matt Sergeant Senior Anti-Spam Technologist MessageLabs

Choose a starting point:

SPAM QUARANTINE. Security Service. Information Technology Services

Security+ Guide to Network Security Fundamentals, Fourth Edition. Chapter 5 Host, Application, and Data Security

Symantec Protection Suite Add-On for Hosted Security

TOTAL CONTROL SECURITY END USER GUIDE

IBM Managed Security Services for Security

MailMarshal SMTP Anti-Spam Configuration

Terms and Conditions

GFI product comparison: GFI MailEssentials vs. Trend Micro ScanMail Suite for Microsoft Exchange

Security Gap Analysis: Aggregrated Results

to Stay Out of the Spam Folder

Content Based Spam Filtering

New User Registration:

Barracuda Security Service User Guide

Transcription:

Collaborative E-Mail Filtering Doug Herbers Master s Oral Defense June 28, 2005

Background Spamming the use of any electronic communications medium to send unsolicited messages in bulk E-Mail is the most common medium, also cell phones, text messaging, and pagers SPAM has developed a negative reputation SPAM ~ Door-to-Door Sales ~ Junk Postal Mail

E-Mail & SPAM Trends On average, 31 billion e-mails were sent each day in 2002 MSNBC reports 66% of World s E-Mail is SPAM (May 2004) MessageLabs reports 76% of e-mail received by their clients in May 2004 was SPAM, projected 81% by February 2005

SPAM Trends MessageLabs filtering results of E-Mail Worldwide

Legislation & Litigation Make Short-Term Decreases in SPAM CAN-SPAM (Controlling the Assault of Non- Solicited Pornography and Marketing Act), December 2003 Message must have valid headers Subject must represent content of e-mail Message must include a valid postal address of the sender Message must include an unsubscribe notification, by which e-mails will cease in less than 10 days after submission

Why is SPAM Harmful? Decreased productivity for employees according to estimates, a company with 200 employees will waste about 5000 minutes per month, up to $3,000 per month dealing with SPAM Exposing children to inappropriate material Congestion of Internet Service Provider s Networks costs are passed on to consumers Scams and devious behavior, virus, denial of service attacks, etc.

Why Do We Need a Filter? Decrease the quantity of SPAM messages in our inboxes Protect minors from inappropriate content Protect from scams Protect from viruses

Previous Filtering Techniques Rule-Based Systems Rules that have to be manually changed to adapt to new SPAM Statistical (TF-IDF) Statistical based on word frequencies *Naïve Bayes Probabilistic, trained on e-mails, will adapt and get better over time, false positive rates less than 1 in 1000. Memory-Based Filtering Vector-based, judgments made by considering knn related e- mail message vectors

Previous Filtering Techniques Blacklists and Whitelists allow or deny specific users to sent you e-mail, usually requires some form of handshaking Collaborative Filtering local vs worldwide model

E-Mail Harvesting

Goals Use multiple user s e-mail to identify and remove SPAM Apply algorithms at the user-level and system-level, and compare results Show an improvement over SpamAssassin alone

Approach Preliminary Filter SpamAssassin Identify duplicated messages Remove all duplicate messages Evaluate various definitions of duplicate

Data Collection Collection of two weeks of e-mail Week 1 Test Data Set Week 2 Validation Data Set

User Set Sixteen Volunteers from ITTC 2 Professors 3 Ph.D. Students 7 M.S. Students 2 B.S. Students 2 Staff Members

E-Mail Classification Classification of e-mail determined by what the user did with each message Location Read/Unread Classification Inbox Read Legitimate Inbox Unread Void SPAM Folder Read or Unread SPAM Trash Read Legitimate Trash Unread SPAM

User Total Messages Week 1 - Test Data Set Legitimate Messages % SPAM % SpamAssassi n 1 818 70 9 % 747 91 % 675 90 % 1 2 17 7 41 % 1 6 % 0 0 % 9 3 51 45 88 % 6 12 % 0 0 % 0 % Void Messages 4 922 236 26 % 676 73 % 641 95 % 10 5 434 105 24 % 324 75 % 292 90 % 5 6 11 11 100 % 0 0 % 0 0 % 0 7 8 0 0 % 7 88 % 0 0 % 1 8 8 7 88 % 0 0 % 0 0 % 1 9 54 12 22 % 19 35 % 16 84 % 23 10 305 32 10 % 252 83 % 194 77 % 21 11 3 3 100 % 0 0 % 0 0 % 0 12 8 2 25 % 0 0 % 0 0 % 6 13 106 5 5 % 89 84 % 8 9 % 12 14 1516 228 13 % 1208 85 % 1088 85 % 2 15 0 0 0 % 0 0 % 0 0 % 0 16 48 43 90 % 1 2 % 0 0 % 4 Totals 4309 806 19 % 3408 79 % 2914 86 % 95

User Selection > 20 Messages Total > 1 SPAM Message > 1 Legitimate Message Nine Users Remain

Determination of Baseline Remove all void messages (95) Remove intra-server e-mail (514) Remove all messages tagged as SPAM by SpamAssassin (2883) Legitimate According to User SPAM According to User Tagged as Legitimate 338 0 Tagged as SPAM 441 2883

Revised Data Set User Total Messages Legitimate % SPAM % 1 124 52 42 % 72 58 % 3 5 4 80 % 1 20 % 4 115 82 71 % 33 29 % 5 90 58 64 % 32 26 % 9 3 0 0 % 3 100 % 10 59 16 27 % 43 73 % 13 82 4 5 % 78 95 % 14 272 93 34 % 179 66 % 16 29 29 100 % 0 0 % Totals 779 338 43 % 441 57 %

Evaluation Criteria Tagged as Legitimate Tagged as SPAM Legitimate According to User Legitimate Passed Legitimate Removed (False Positive) SPAM According to User SPAM Passed (False Negative) SPAM Removed Recall = Legitimate Passed Legitimate Passed + Legitimate Removed Precision = Legitimate Passed Legitimate Passed + SPAM Passed Accuracy = Legitimate Passed + SPAM Removed All Messages in the Data Set

Evaluation Criteria (cont.) FMeasure = 2 2 (1 + β ) Precision Recall β Precision + Recall Chose Beta=2.0 to weight recall higher than precision

User-Level Remove all duplicates within a user s e-mail box User 1 User 2 User 3 Msg 1 Msg 2 Msg 3 Msg 2 Msg 2 Msg 4 Msg 5 Msg 6 Msg 6 Message 2 counts as one message with two duplicates

System-Level Remove all duplicates over all e-mail boxes User 1 User 2 User 3 Msg 1 Msg 2 Msg 3 Msg 2 Msg 2 Msg 4 Msg 5 Msg 6 Msg 6 Message 2 counts as one message with three duplicates Message 6 counts as one message with two duplicates

Classification of Msg 2 User 2 User 3 Number of Copies Classify as Legitimate 2 1 1 1 0 1 Classify as SPAM User-Level 1 legitimate removed & 1 SPAM removed System-Level 1 legitimate removed & 2 SPAM removed

Qualities of Messages Algorithm 1: Subject, User-Level Algorithm 2: Subject, System-Level Algorithm 3: Sender, User-Level Algorithm 4: Sender, System-Level Algorithm 5: Body, User-Level Algorithm 6: Body, System-Level

Algorithm 3 User-Level Sender Duplicates

All Messages not sent from *ku.edu (or *ukans.edu) after Baseline 350 300 250 Instances 200 150 100 50 0 1 2 3 4 5 6 7 8 Legitimate Messages 135 38 28 5 5 0 1 1 SPAM Messages 322 37 7 2 3 2 1 1 Copies (sender, user-level)

Precision, Recall and F-Measure 1.20 1.00 0.80 0.60 0.40 0.20 0.00 2 3 4 5 6 7 8 Recall 0.399 0.595 0.837 0.896 0.964 0.964 0.982 Precision 0.295 0.342 0.411 0.423 0.434 0.427 0.431 F-Measure 0.373 0.518 0.693 0.732 0.775 0.770 0.782 Copies (sender, user-level)

Accuracy of Algorithm 3 and SpamAssassin 1.00 Instances 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 2 3 4 5 6 7 8 Accuracy 0.857 0.857 0.874 0.877 0.880 0.877 0.878 Copies (subject, system-level)

Algorithm 6 System-Level Body Duplicates

All Messages not sent from *ku.edu (or *ukans.edu) after Baseline 350 300 250 Instances 200 150 100 50 0 1 2 3 4 5 Legitimate Messages 276 27 4 0 0 SPAM Messages 350 37 5 1 1 Copies (body, system-level)

Precision, Recall and F-Measure 1.20 1.00 0.80 0.60 0.40 0.20 0.00 2 3 4 5 Recall 0.817 0.973 1.000 1.000 Precision 0.441 0.438 0.439 0.437 F-Measure 0.698 0.782 0.796 0.795 Copies (body, system-level)

Accuracy of Algorithm 6 and SpamAssassin 1.00 Instances 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 2 3 4 5 Accuracy 0.887 0.882 0.882 0.881 Copies (body, system-level)

F-Measure of Respective Algorithms 1.00 0.90 0.80 F-Measure 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 1:Subject, User-Level 2:Subject, System-Level 3:Sender, User-Level 4:Sender, System-Level 5:Body, User-Level 6:Body, System-Level SpamAssassin 2 3 4 5 6 7 8 9 10 11 12 Copies of Message

0.90 0.89 Accuracy of Respective Algorithms 1:Subject, User-Level 2:Subject, System-Level 3:Sender, User-Level 4:Sender, System-Level 5:Body, User-Level 6:Body, System-Level SpamAssassin Accuracy 0.88 0.87 0.86 0.85 2 3 4 5 6 7 8 9 10 11 12 Copies of Message

Validation Chose Best User-Level Algorithm: Algorithm 3, Duplicates Based on Sender Chose Best System-Level Algorithm: Algorithm 6, Duplicates Based on Body

Precision, Recall and F-Measure 1.20 1.00 0.80 0.60 0.40 0.20 0.00 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Recall 0.399 0.595 0.837 0.896 0.964 0.964 0.982 Precision 0.295 0.342 0.411 0.423 0.434 0.427 0.431 F-Measure 0.373 0.518 0.693 0.732 0.775 0.770 0.782 Recall' 0.378 0.628 0.691 0.788 0.868 0.917 0.917 0.924 0.938 0.938 0.938 0.979 0.979 0.979 Precision' 0.242 0.317 0.320 0.337 0.353 0.357 0.357 0.356 0.357 0.357 0.357 0.367 0.367 0.367 F-Measure' 0.340 0.525 0.561 0.622 0.672 0.698 0.698 0.700 0.708 0.708 0.708 0.734 0.734 0.734 Copies (sender, user-level)

Precision, Recall and F-Measure 1.20 1.00 0.80 0.60 0.40 0.20 0.00 2 3 4 5 6 Recall 0.817 0.973 1.000 1.000 Precision 0.441 0.438 0.439 0.437 F-Measure 0.698 0.782 0.796 0.795 Recall' 0.760 0.983 0.997 0.997 0.997 Precision' 0.375 0.377 0.369 0.369 0.369 F-Measure' 0.631 0.744 0.744 0.744 0.744 Copies (body, system-level)

Conclusions Probability of a message to be SPAM increases as the number of copies of the messages increases Since all algorithms improve with more duplicates, a larger collection of participating users in our study would likely have shown more convincing improvements

Conclusions Only dealing with duplicate messages limited the overall effectiveness One average, the algorithms performed 90% or better as compared to the maximum achievable F-measure SPAM is subjective, and a more personalized filter might be a better solution

Future Work Try the algorithms on a larger community Learning Collaborative Filter create a long-term database of e-mails Collaborative Voting Filter allow users to classify e-mail via mail reader