USING CLASSIFIER CASCADES FOR SCALABLE CLASSIFICATION 9/1/2011

Similar documents
Using Classifier Cascades for Scalable Classification

Network Traffic Measurements and Analysis

Novel Comment Spam Filtering Method on Youtube: Sentiment Analysis and Personality Recognition

Introduction to GPU computing

CS4491/CS 7265 BIG DATA ANALYTICS

Semi-supervised learning and active learning

Logistic Regression: Probabilistic Interpretation

Classification Key Concepts

Video annotation based on adaptive annular spatial partition scheme

Model Complexity and Generalization

Classifier Inspired Scaling for Training Set Selection

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification

UMass at TREC 2006: Enterprise Track

End-To-End Spam Classification With Neural Networks

Machine Learning for NLP

Mind the Gap: Large-Scale Frequent Sequence Mining

CS229 Lecture notes. Raphael John Lamarre Townshend

CSE 446 Bias-Variance & Naïve Bayes

Online Learning for URL classification

Ranking with Query-Dependent Loss for Web Search

PARALLEL CLASSIFICATION ALGORITHMS

Complex Prediction Problems

Classification. 1 o Semestre 2007/2008

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Classification Key Concepts

Introduction to Parallel Programming

Interactive Text Mining with Iterative Denoising

Information Retrieval

Support Vector Machines and their Applications

An Empirical Study of Lazy Multilabel Classification Algorithms

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

ActiveClean: Interactive Data Cleaning For Statistical Modeling. Safkat Islam Carolyn Zhang CS 590

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

Nonparametric Methods Recap

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

FAST MARGIN-BASED COST-SENSITIVE CLASSIFICATION. Feng Nan, Joseph Wang, Kirill Trapeznikov, Venkatesh Saligrama. Boston University

Topology basics. Constraints and measures. Butterfly networks.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

CS489/698 Lecture 2: January 8 th, 2018

Automatic Domain Partitioning for Multi-Domain Learning

Spam Classification Documentation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

The evolution of malevolence

How do we obtain reliable estimates of performance measures?

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 9. Prof. James She

Learning from High Dimensional fmri Data using Random Projections

Tyco Data Set Integration Project

Lecture 13: Model selection and regularization

Machine Learning: Think Big and Parallel

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

SCHISM: A WORKLOAD-DRIVEN APPROACH TO DATABASE REPLICATION AND PARTITIONING

THE most popular object detectors in recent years, such

Bias-Variance Analysis of Ensemble Learning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Enterprise QoS. Tim Chung Network Architect Google Corporate Network Operations March 3rd, 2010

SUPPORT VECTOR MACHINES

AVC Configuration. Unified Policy CLI CHAPTER

Business Strategy Theatre

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Classification by Support Vector Machines

Natural Language Processing

Maximum Entropy (Maxent)

Multicollinearity and Validation CIVL 7012/8012

Lecture Linear Support Vector Machines

Project Presentation. Pattern Recognition. Under the guidance of Prof. Sumeet Agar wal

Dimensionality reduction as a defense against evasion attacks on machine learning classifiers

Computational Methods. Constrained Optimization

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Conquering Massive Clinical Models with GPU. GPU Parallelized Logistic Regression

Introduction to Artificial Intelligence

Chapter 8 The C 4.5*stat algorithm

CSC411/2515 Tutorial: K-NN and Decision Tree

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

Multilevel Optimization for Multi-Modal X-ray Data Analysis

Semi-supervised Learning

Regularization and model selection

Time Series Classification in Dissimilarity Spaces

Spam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Building Classifiers using Bayesian Networks

Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005)

Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms

Department of Computer Science & Engineering The Graduate School, Chung-Ang University. CAU Artificial Intelligence LAB

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Training for Fast Sequential Prediction Using Dynamic Feature Selection

FlashBlox: Achieving Both Performance Isolation and Uniform Lifetime for Virtualized SSDs

GPU Sparse Graph Traversal

CS294-1 Final Project. Algorithms Comparison

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Oracle NoSQL Database and Cisco- Collaboration that produces results. 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Multi-label Classification. Jingzhou Liu Dec

Cross-validation. Cross-validation is a resampling method.

Transcription:

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay Pujara Hal Daumé III Lise Getoor jay@cs.umd.edu me@hal3.name getoor@cs.umd.edu 9/1/2011

Building a scalable e-mail system 1 Goal: Maintain system throughput across conditions Varying conditions Load varies Resource availability varies Task varies Challenge: Build a system that can adapt its operation to the conditions at hand

Problem structure informs scalable solution 2 Feature Structure Class Structure $ IP Ham Spam Cost Mail From Subject Business Personal Newsgroup Social Network Granularity $$$ Body

Important facets of problem 3 Structure in input Features may have an order or systemic dependency Acquisition costs vary: cheap or expensive Structure in output Labels naturally have a hierarchy from coarse-to-fine Different levels of hierarchy have different sensitivities to cost Exploit structure during classification Minimize costs, minimize error

Two overarching questions 4 When should we acquire to classify a message? How does this acquisition policy change across different classification tasks? Classifier Cascades can answer both questions!

Introducing Classifier Cascades 5 f 1 Series of classifiers: f 1, f 2, f 3... f n f 2 f 3...

Introducing Classifier Cascades 6 Cost: c 1 Cost: c 2 f 1 (ϕ 1 ) f 2 (ϕ 1, ϕ 2 ) Series of classifiers: f 1, f 2, f 3... f n Each classifier operates on different, increasingly expensive sets of (ϕ) with costs c 1, c 2, c 3... c n Cost: c 3 f 3 (ϕ 1, ϕ 2, ϕ 3 )...

Introducing Classifier Cascades 7 Cost: c 1 Cost: c 2 Cost: c 3 f 1 (ϕ 1 ) f 2 (ϕ 1, ϕ 2 ) f 3 (ϕ 1, ϕ 2, ϕ 3 )... f 1 (ϕ 1 ) f 2 (ϕ 1, ϕ 2 ) f 3 (ϕ 1, ϕ 2, ϕ 3 ) Series of classifiers: f 1, f 2, f 3... f n Each classifier operates on different, increasingly expensive sets of (ϕ) with costs c 1, c 2, c 3... c n Classifier outputs a value in [-1,1], the margin or confidence of decision

Introducing Classifier Cascades 8 Cost: c 1 Cost: c 2 Cost: c 3 f 1 (ϕ 1 ) f 2 (ϕ 1, ϕ 2 ) f 1 <γ 1 f 2 <γ 2 f 3 (ϕ 1, ϕ 2, ϕ 3 ) f 3 <γ 3... f 1 (ϕ 1 ) f 2 (ϕ 1, ϕ 2 ) f 3 (ϕ 1, ϕ 2, ϕ 3 ) Series of classifiers: f 1, f 2, f 3... f n Each classifier operates on different, increasingly expensive sets of (ϕ) with costs c 1, c 2, c 3... c n Classifier outputs a value in [-1,1], the margin or confidence of decision γ parameters control the relationship of classifiers

Optimizing Classifier Cascades 9 Loss function: L(y, F(x)) errors in classification Minimize loss function, incorporating cost Cost-constraint with budget (load-sensitive): min (x,y) D L(y, F(x)) s.t. C(x) <B Cost Sensitive loss function (granular): Use grid-search to find optimal γ parameters

12 Load-Sensitive Classification

Features have costs & dependencies 13 $ IP Network packets Cache Size Cost Mail From Subject $$$ Body IP is known at socket connect time, is 4 bytes in size

Features have costs & dependencies 14 $ IP Network packets Cache Size Cost Mail From Subject $$$ Body The Mail From is one of the first commands of an SMTP conversation From addresses have a known format, but higher diversity

Features have costs & dependencies 15 $ IP Network packets Cache Size Cost Mail From Subject $$$ Body The subject, one of the mail headers, occurs after a number of network exchanges. Since the subject is user-generated, it is very diverse and often lacks a defined format

Load-Sensitive Problem Setting 16 IP Classifier f 1 <γ 1 f 2 <γ 2 MailFrom Classifier Subject Classifier Train IP, MailFrom, and Subject classifiers For a given budget, B, choose γ 1, γ 2 that minimize error within B Constraint: C(x) < B

Load-Sensitive Challenges 17 Overfitting model when choosing γ 1, γ 2 Train-time costs underestimated versus test-time performance Use a regularization constant Δ Sensitive to cost variance (σ) Accounts for variability Revised constraint: C(x) + σ < B

18 Granular Classification

E-mail Challenges: Spam Detection 19 Ham Spam Most mail is spam Billions of classifications Must be incredibly fast

E-mail Challenges: Categorizing Mail 20 Business Personal Ham Newsgroup Spam Social Network Computationally intensive processing Each task applies to one class E-mail does more, tasks such as: Extract receipts, tracking info Thread conversations Filter into mailing lists Inline social network response

Coarse task is constrained by feature cost 21 Feature Structure Class Structure $ IP λ c Ham Spam Cost Mail From Subject Business Personal Newsgroup Social Network Granularity $$$ Body

Fine task is constrained by misclassification cost 22 Feature Structure Class Structure $ IP Ham Spam Cost Mail From Subject λ f Business Personal Newsgroup Social Network Granularity $$$ Body

Granular Classification Problem Setting 23 Ham Spam IP MailFrom Subject L(y, h(x)) + λ c C(x) Business Personal Newsgroup Social Network IP MailFrom Subject L(y, h(x)) + λ f C(x) Two separate models for different tasks, with different classifiers and cascade parameters Choose γ 1, γ 2 for each cascade to balance accuracy and cost with different tradeoffs λ

27 Experimental Results

Experimental Setup: Overview 28 Two tasks: load-sensitive & granular classification Two datasets: Yahoo! Mail corpus and TREC-2007 Load-sensitive uses both datasets, granular uses only Yahoo! Results are L1O, 10-fold CV with bold values significant (p<.05) Cascade stages use MEGAM MaxEnt classifier

Experimental Setup: Yahoo! Data 29 Class Messages Spam 531 Business 187 Social Network 223 Newsletter 174 Personal/Other 102 Feature Cost IP.168 MailFrom.322 Subject.510 Data from 1227 Yahoo! Mail messages from 8/2010 Feature costs calculated from network + storage cost

Experimental Setup: TREC data 30 Class Messages Spam 39055 Ham 8139 Data from TREC-2007 Public Spam Corpus, 47194 messages Use same feature cost estimates

Results: Load-Sensitive Classification 32 Regularization prevents cost excesses Y!Mail Dataset Δ Y!Mail TREC 0.115.059.25.020 0.00 Average excess cost

Results: Load-Sensitive Classification 33 Significant error reduction 0.14 Classification Error across methods in different datasets Classification Error (L(x)) 0.12 0.1 0.08 0.06 0.04 0.02 Naive ACC, Δ=0 ACC, Δ=.25 ACC, Δ=.5 0 Yahoo! Mail Dataset TREC-2007

Results: Granular Classification 35 Feature Set Feature Cost Misclass Cost Coarse Fine Overall Fixed: IP.168.139.181.229 ACC: λ c =1.5, λ f =1.187.140.156.217 Fixed: IP+MailFrom.490.128.142.200 ACC: λ c =.1, λ f =.075.431.111.100.163 Fixed: IP+MailFrom+Subject 1.00.106.108.162 ACC: λ c =.02, λ f =.02.691.108.105.162 Compare fixed feature acquisition policies to adaptive classifiers Significant gains in performance or cost (or both) depending on tradeoff

36 Dynamics of choosing λ c and λ f

37 Different approaches, same tradeoff

Conclusion 38 Problem of scalable e-mail classification Introduce two settings Load-sensitive Classification: known budget Granular Classification: task sensitivity Use classifier cascades to achieve tradeoff between cost and accuracy Demonstrate results superior to baseline Questions? Research funded by Yahoo! Faculty Research Engagement Program