Logistic Regression: Probabilistic Interpretation

Similar documents
ID2223 Lecture 3: Gradient Descent and SparkML

Rapid growth of massive datasets

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Network Traffic Measurements and Analysis

Probabilistic Classifiers DWML, /27

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Evaluating Classifiers

Generative and discriminative classification techniques

CSE 158. Web Mining and Recommender Systems. Midterm recap

Louis Fourrier Fabien Gaie Thomas Rolf

The exam is closed book, closed notes except your one-page cheat sheet.

Evaluating Classifiers

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Lecture 26: Missing data

Linear Regression Optimization

Applying Supervised Learning

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Semi-supervised Learning

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CS145: INTRODUCTION TO DATA MINING

Data Mining and Knowledge Discovery Practice notes 2

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Simple Model Selection Cross Validation Regularization Neural Networks

Data Mining and Knowledge Discovery: Practice Notes

INF 4300 Classification III Anne Solberg The agenda today:

Classification Algorithms in Data Mining

CSE 546 Machine Learning, Autumn 2013 Homework 2

Machine Learning: Think Big and Parallel

Boolean Classification

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Machine Learning. Chao Lan

Lecture 9: Support Vector Machines

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Features: representation, normalization, selection. Chapter e-9

Problem 1: Complexity of Update Rules for Logistic Regression

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Information Management course

Weka ( )

Logistic Regression and Gradient Ascent

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Classification. Slide sources:

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Lecture 25: Review I

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Machine Learning Classifiers and Boosting

Support Vector Machines.

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

10601 Machine Learning. Model and feature selection

Machine Learning / Jan 27, 2010

Evaluating Classifiers

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Structured Learning. Jun Zhu

CS4491/CS 7265 BIG DATA ANALYTICS

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

1 Machine Learning System Design

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Supervised Learning: K-Nearest Neighbors and Decision Trees

Coding for Random Projects

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Cuckoo Linear Algebra

Voxel selection algorithms for fmri

Instance-based Learning

Week 3: Perceptron and Multi-layer Perceptron

Supervised vs unsupervised clustering

Facial Expression Classification with Random Filters Feature Extraction

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Neural Networks (pp )

Predict Employees Computer Access Needs in Company

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Multi-Class Logistic Regression and Perceptron

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

All lecture slides will be available at CSC2515_Winter15.html

Going nonparametric: Nearest neighbor methods for regression and classification

10 things I wish I knew. about Machine Learning Competitions

Lecture 06 Decision Trees I

Data Mining and Knowledge Discovery: Practice Notes

CS249: ADVANCED DATA MINING

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Predicting Popular Xbox games based on Search Queries of Users

CS 229 Project Report:

Announcements: projects

Evaluating Machine-Learning Methods. Goals for the lecture

CLASSIFICATION JELENA JOVANOVIĆ. Web:

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CS294-1 Assignment 2 Report

Global Probability of Boundary

CS6220: DATA MINING TECHNIQUES

Use of Synthetic Data in Testing Administrative Records Systems

Transcription:

Logistic Regression: Probabilistic Interpretation

Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge), Logistic regression (logistic), Adaboost (exponential)

Probabilistic Interpretation Goal: Model conditional probability: P[ y = 1 x ] Example: Predict rain from temperature, cloudiness, humidity x x P[ y = rain t = 14 F, c = LOW, h = 2%]=.05 P[ y = rain t = 70 F, c = HIGH, h = 95%]=.9 Example: Predict click from ad s historical performance, user s click frequency, and publisher page s relevance x x P[ y = click h = GOOD, f = HIGH, r = HIGH ]=.1 P[ y = click h = BAD, f = LOW, r = LOW ]=.001

Probabilistic Interpretation Goal: Model conditional probability: P[ y = 1 x ] First thought: P[ y = 1 x ]=w x Linear regression returns any real number, but probabilities range from 0 to 1! How can we transform or squash its output? Use logistic (or sigmoid) function: P[ y = 1 x ]=σ(w x)

Logistic Function σ(z) Maps real numbers to [0, 1] Large positive inputs 1 Large negative inputs 0 z σ(z) = 1 1 +exp( z)

Probabilistic Interpretation Goal: Model conditional probability: P[ y = 1 x ] Logistic regression uses logistic function to model this conditional probability P[ y = 1 x ]=σ(w x) P[ y = 0 x ]=1 σ(w x) For notational convenience we now define y {0, 1}

How Do We Use Probabilities? To make class predictions, we need to convert probabilities to values in {0, 1} We can do this by setting a threshold on the probabilities Default threshold is 0.5 x P[ y = 1 x ] > 0.5 = ŷ = 1 Example: Predict rain from temperature, cloudiness, humidity P[ y = rain t = 14 F, c = LOW, h = 2%]=.05 ŷ = 0 P[ y = rain t = 70 F, c = HIGH, h = 95%]=.9 ŷ = 1

Connection with Decision Boundary? Decision Boundary x 2 predictions: ŷ = sign(w x) Threshold by sign to make class ŷ = 1 : w x > 0 ŷ = 0 : w x < 0 decision boundary: w x = 0 x 1 How does this compare with thresholding probability? P[ y = 1 x ]=σ(w x) > 0.5 = ŷ = 1

Connection with Decision Boundary? σ ( z) σ ( 0 ) = 0.5 3 ˆy = i=0 z w x=0 σ (w x) = 0.5 Threshold by sign to make class wi xi =predictions: w x = ˆy = sign(w x) ˆy = 1 : w x > 0 ˆy = 0 : w x < 0 decision boundary: w x = 0 How does this compare with thresholding probability? x y = 1 x ]P[ x ] > 0.5 = ˆy = 1 P[ =yσ= (w1 x) With threshold of 0.5, the decision boundaries are identical!

Using Probabilistic Predictions

How Do We Use Probabilities? To make class predictions, we need to convert probabilities to values in {0, 1} We can do this by setting a threshold on the probabilities Default threshold is 0.5 x P[ y = 1 x ] > 0.5 = ŷ = 1 Example: Predict rain from temperature, cloudiness, humidity P[ y = rain t = 14 F, c = LOW, h = 2%]=.05 ŷ = 0 P[ y = rain t = 70 F, c = HIGH, h = 95%]=.9 ŷ = 1

Setting different thresholds In spam detection application, we model P[ y = spam x ] Two types of error Classify a not-spam email as spam (false positive, FP) Classify a spam email as not-spam (false negative, FN) Can argue that false positives are more harmful than false negatives Worse to miss an important email than to have to delete spam We can use a threshold greater than 0.5 to be more conservative

ROC Plots: Measuring Varying Thresholds Perfect Classification Threshold = 0 ROC plot displays FPR vs TPR Top left is perfect Dotted Line is random prediction (i.e., biased coin flips) Can classify at various thresholds (T) T = 0: Everything is spam TPR = 1, but FPR = 1 True Positive Rate (TPR) FPR at 0.1 Random Prediction T = 1: Nothing is spam FPR = 0, but TPR = 0 Threshold = 1 False Positive Rate (FPR) We can tradeoff between TPR/FPR FPR: % not-spam predicted as spam TPR: % spam predicted as spam

Working Directly with Probabilities Example: Predict click from ad s historical performance, user s click frequency, and publisher page s relevance P[ y = click h = GOOD, f = HIGH, r = HIGH ]=.1 P[ y = click h = BAD, f = LOW, r = LOW ]=.001 ŷ = 0 ŷ = 0 Success can be less than 1% [Andrew Stern, imedia Connection, 2010] Probabilities provide more granular information Confidence of prediction Useful when combining predictions with other information In such cases, we want to evaluate probabilities directly Logistic loss makes sense for evaluation!

Logistic Loss log(p, y) = log(p) if y = 1 log(1 p) if y = 0 When y = 1, we want p = 1 No penalty at 1 log(p) log(1 p) Increasing penalty away from 1 Similar logic when y = 0 p

Categorical Data and One-Hot-Encoding

Logistic Regression Optimization Regularized Logistic Regression: Learn mapping ( ) that minimizes logistic loss on training data with a regularization term w n Training LogLoss Model Complexity min w 0/1 y (i) w x (i) + λ w 2 2 i=1 Data is assumed to be numerical! Similar story for linear regression and many other methods

Raw Data is Sometimes Numeric Images User Ratings

Raw Data is Often Non-Numeric Email Genomic Web hypertext Data

Raw Data is Often Non-Numeric Example: Click-through Rate Prediction User features: Gender, Nationality, Occupation, Advertiser / Publisher: Industry, Location, Ad / Publisher Site: Language, Text, Target Audience,

How to Handle Non-Numeric Features? Option 1: Use methods that support these features Some methods, e.g., Decision Trees, Naive Bayes, naturally support non-numerical features However, this limits our options Option 2: Convert these features to numeric features Allows us to use a wider range of learning methods How do we do this?

Types of Non-Numeric Features Categorical Feature Has two or more categories No intrinsic ordering to the categories E.g., Gender, Country, Occupation, Language Ordinal Feature Has two or more categories Intrinsic ordering, but no consistent spacing between categories, i.e., all we have is a relative ordering Often seen in survey questions, e.g., Is your health poor, reasonable, good, excellent

Non-Numeric Numeric One idea: Create single numerical feature to represent nonnumeric one Ordinal Features: Health categories = { poor, reasonable, good, excellent } poor = 1, reasonable = 2, good = 3, excellent = 4 We can use a single numerical feature that preserves this ordering but ordinal features only have an ordering and we introduce a degree of closeness that didn't previously exist

Non-Numeric Numeric One idea: Create single numerical feature to represent nonnumeric one Categorical Features: Country categories = { ARG, FRA, USA } ARG = 1, FRA = 2, USA = 3 Mapping implies FRA is between ARG and USA Creating single numerical feature introduces relationships between categories that don t otherwise exist

Non-Numeric Numeric Another idea (One-Hot-Encoding): Create a dummy feature for each category Categorical Features: Country categories = { ARG, FRA, USA } We introduce one new dummy feature for each category ARG [1 0 0], FRA [0 1 0], USA [0 0 1] Creating dummy features doesn t introduce spurious relationships

Computing and Storing OHE Features

Example: Categorical Animal Dataset Features: Animal = { bear, cat, mouse } Color = { black, tabby } Diet (optional) = { mouse, salmon } Datapoints: A1 = [ mouse, black, - ] A2 = [ cat, tabby, mouse ] A3 = [ bear, black, salmon ] How can we create OHE features?

Step 1: Create OHE Dictionary Features: Animal = { bear, cat, mouse } Color = { black, tabby } Diet = { mouse, salmon } 7 dummy features in total mouse category distinct for Animal and Diet features OHE Dictionary: Maps each category to dummy feature (Animal, bear ) 0 (Animal, cat ) 1 (Animal, mouse ) 2 (Color, black ) 3

Step 2: Create Features with Dictionary Datapoints: A1 = [ mouse, black, - ] A2 = [ cat, tabby, mouse ] A3 = [ bear, black, salmon ] OHE Features: Map non-numeric feature to it s binary dummy feature E.g., A1 = [0, 0, 1, 1, 0, 0, 0] OHE Dictionary: Maps each category to dummy feature (Animal, bear ) 0 (Animal, cat ) 1 (Animal, mouse ) 2 (Color, black ) 3

OHE Features are Sparse For a given categorical feature only a single OHE feature is non-zero can we take advantage of this fact? Dense representation: Store all numbers E.g., A1 = [0, 0, 1, 1, 0, 0, 0] Sparse representation: Store indices / values for non-zeros Assume all other entries are zero E.g., A1 = [ (2,1), (3,1) ]

Sparse Representation Example: Matrix with 10M observation and 1K features Assume 1% non-zeros Dense representation: Store all numbers Store 10M 1K entries as doubles 80GB storage Sparse representation: Store indices / values for non-zeros Store value and location for non-zeros (2 doubles per entry) 50 savings in storage! We will also see computational saving for matrix operations

Feature Hashing

Non-Numeric Numeric One-Hot-Encoding: Create a dummy feature for each category Creating dummy features doesn t introduce spurious relationships Dummy features can drastically increase dimensionality Number of dummy features equals number of categories! Issue with CTR prediction data Includes many names (of products, advertisers, etc.) Text from advertisement, publisher site, etc.

Bag of Words Representation 0 been From: illegitimate@bad.com Vocabulary 1 debt "Eliminate your debt by giving us your money..." From: bob@good.com "Hi, it's been a while! How are you?..." been debt eliminate giving how it's money while From: illegitimate@bad.com "Eliminate your debt by giving us your money..." 1 1 0 0 1 0 eliminate giving how it's money while Represent each document with a vocabulary of words Over 1M words in English [Global Language Monitor, 2014] We sometimes consider bigrams or adjacent words (similar idea to quadratic features)

High Dimensionality of OHE Statistically: Inefficient learning We generally need bigger n when we have bigger d (though in distributed setting we often have very large n) We will have many non-predictive features Computationally: Increased communication Linear models have parameter vectors of dimension d Gradient descent communicates the parameter vector to all workers at each iteration

How Can We Reduce Dimension? One Option: Discard rare features Might throw out useful information (rare uninformative) Must first compute OHE features, which is expensive Can view as an unsupervised Another Option: Feature hashing learning preprocessing step Use hashing principles to reduce feature dimension Obviates need to compute expensive OHE dictionary Preserves sparsity Theoretical underpinnings

High-Level Idea Hash tables are an efficient data structure for data lookup, and hash functions also useful in cryptography Hash Function: Maps an object to one of m buckets Should be efficient and distribute objects across buckets In our setting, objects are feature categories We have fewer buckets than feature categories Different categories will collide, i.e., map to same bucket Bucket indices are hashed features

Feature Hashing Example Datapoints: 7 feature categories A1 = [ mouse, black, - ] A2 = [ cat, tabby, mouse ] A3 = [ bear, black, salmon ] Hashed Features: A1 = [ 0 0 1 1 ] A2 = [ 2 0 1 0 ] A3 = [ 1 1 1 0 ] Hash Function: m = 4 H(Animal, mouse ) = 3 H(Color, black ) = 2 H(Animal, cat ) = 0 H(Color, tabby ) = 0 H(Diet, mouse ) = 2 H(Animal, bear ) = 0 H(Color, black ) = 2 H(Diet, salmon ) = 1

Why Is This Reasonable? Hash features have nice theoretical properties Good approximations of inner products of OHE features under certain conditions Many learning methods (including linear / logistic regression) can be viewed solely in terms of inner products Good empirical performance Spam filtering and various other text classification tasks Hashed features are a reasonable alternative for OHE features

Distributed Computation Step 1: Apply hash function on raw data Local computation and hash functions are usually fast No need to compute OHE features or communication Step 2: Store hashed features in sparse representation Local computation Saves storage and speeds up computation

CTR Prediction Pipeline / Lab Preview

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Split Data Goal: Estimate P (click user, ad, publisher info) Feature Extraction Given: Massive amounts of labeled data Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Raw Data: Criteo Dataset from Kaggle competition We ll work with subsample of larger CTR dataset 39 masked user, ad and publisher features Full Kaggle dataset has 33M distinct categories (and this dataset is a small subset of Criteo s actual data) Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Split Data: Create training, validation, and test sets Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Feature Extraction: One-hot-encoding and feature hashing We ll use a sparse data representation We ll visualize feature frequency Feature extraction is the main focus of this lab Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Split Data Supervised Learning: Logistic regression Use MLlib implementation Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Evaluation (Part 1): Hyperparameter tuning Grid search to find good values for regularization Evaluate using logistic loss Visualize grid search Visualize predictions via ROC curve Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Evaluation (Part 2): Evaluate final model Evaluate using logistic loss Compare to baseline model (always predict value equal to the fraction of training points that correspond to click-through events) Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Split Data Predict: Final model could be used to predict click-through rate for new user-ad tuple (we won t do this though) Feature Extraction Supervised Learning Evaluation Predict