Artificial Neural Networks (Feedforward Nets)

Similar documents
6.034 Notes: Section 8.1

Dimensionality Reduction, including by Feature Selection.

The exam is closed book, closed notes except your one-page cheat sheet.

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

5 Learning hypothesis classes (16 points)

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Network Traffic Measurements and Analysis

6.034 Quiz 2, Spring 2005

Variable Selection 6.783, Biomedical Decision Support

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Performance Evaluation of Various Classification Algorithms

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Machine Learning in Biology

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

CS6220: DATA MINING TECHNIQUES

Classification. Slide sources:

Machine Learning Classifiers and Boosting

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Tutorials Case studies

Perceptron as a graph

Simple Model Selection Cross Validation Regularization Neural Networks

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Network Traffic Measurements and Analysis

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

MUSI-6201 Computational Music Analysis

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Why MultiLayer Perceptron/Neural Network? Objective: Attributes:

CS145: INTRODUCTION TO DATA MINING

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

INF 4300 Classification III Anne Solberg The agenda today:

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

3 Nonlinear Regression

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Supervised vs unsupervised clustering

5 Machine Learning Abstractions and Numerical Optimization

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Lecture #11: The Perceptron

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluating Classifiers

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

CMPT 882 Week 3 Summary

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

CSE 158. Web Mining and Recommender Systems. Midterm recap

Features: representation, normalization, selection. Chapter e-9

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Weka ( )

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Evaluating Classifiers

Bioinformatics - Lecture 07

Image Processing. Image Features

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Evaluating Classifiers

10-701/15-781, Fall 2006, Final

A Systematic Overview of Data Mining Algorithms

Introduction to Machine Learning

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Machine Learning / Jan 27, 2010

Regularization and model selection

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

Fall 09, Homework 5

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

DI TRANSFORM. The regressive analyses. identify relationships

CS249: ADVANCED DATA MINING

Random Forest A. Fornaser

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

2. On classification and related tasks

Please write your initials at the top right of each page (e.g., write JS if you are Jonathan Shewchuk). Finish this by the end of your 3 hours.

Unsupervised Learning

Machine Learning with MATLAB --classification

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Based on Raymond J. Mooney s slides

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

IBL and clustering. Relationship of IBL with CBR

3 Nonlinear Regression

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Deep Learning for Computer Vision

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Machine Learning 13. week

Classification: Linear Discriminant Functions

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

4. Feedforward neural networks. 4.1 Feedforward neural network structure

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Transcription:

Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1

Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x n 6.034 - Spring 2

Linear Classifier Single Perceptron Unit h( x ) = #( w! x + b) " #( w! x) %( z) # 1 z $ 0 = "! 0 else x 2 w y x w 0 w 1 w n x 1 w 2 w 3 x 0 =1 x 1 x 2 x 3... x n 6.034 - Spring 3

Beyond Linear Separability 1 0 Not linearly separable 0 1 6.034 - Spring 4

Beyond Linear Separability 1 0 0 1 1 0 Not linearly separable 0 1 6.034 - Spring 5

Beyond Linear Separability 1 0 0 1 1 0 Not linearly separable 0 1 1 0 0 1 6.034 - Spring 6

Multi-Layer Perceptron x x 1 x 2 y x 3 y 1 y 2 x u input hidden hidden output More powerful than single layer. Lower layers transform the input problem into more tractable (linearly separable) problems for subsequent layers. 6.034 - Spring 7

XOR Problem o 1 Not linearly separable 1 0 0 1 o 2 o 1 w 11 x w 1 13 w 12 w 21 x 2 w 22 w 01-1 w 02-1 w 23 o 2 w 03-1 y w 01 = 3/2 w 11 = w 12 = 1 x 1 x 2 o 1 w 02 = 1/2 w 21 = w 22 = 1 0 0 1 1 0 1 0 1 0 0 0 1 6.034 - Spring 8

XOR Problem o 1 Not linearly separable 1 0 0 1 o 2 o 1 w 11 x w 1 13 w 12 w 21 x 2 w 22 w 01-1 w 02-1 w 23 o 2 w 03-1 y w 01 = 3/2 w 11 = w 12 = 1 x 1 x 2 o 1 o 2 w 02 = 1/2 w 21 = w 22 = 1 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 1 6.034 - Spring 9

XOR Problem o 1 Not linearly separable 1 0 0 1 o 2 o 1 w 11 x w 1 13 w 12 w 21 x 2 w 22 w 01-1 w 02-1 w 23 o 2 w 03-1 y w 01 = 3/2 w 11 = w 12 = 1 w 02 = 1/2 w 21 = w 22 = 1 w 03 = 1/2 w 31 = -1, w 32 = 1 x 1 0 0 1 1 x 2 0 1 0 1 o 1 0 0 0 1 o 2 0 1 1 1 y 0 1 1 0 o 2 1 0 0 Linearly separable o 1 6.034 - Spring 10

Multi-Layer Perceptron Learning Any set of training points can be separated by a three-layer perceptron network. Almost any set of points separable by two-layer perceptron network. But, no efficient learning rule is known. 6.034 - Spring 11

Multi-Layer Perceptron Learning Any set of training points can be separated by a three-layer perceptron network. Almost any set of points separable by two-layer perceptron network. But, no efficient learning rule is known. May need an exponential number of units. Two hidden layers and one output layer One hidden layer and one output layer 6.034 - Spring 12

Multi-Layer Perceptron Learning Any set of training points can be separated by a three-layer perceptron network. Almost any set of points separable by two-layer perceptron network. But, no efficient learning rule is known. Could we use gradient ascent/descent? We would need smoothness: small change in weights produces small change in output. Threshold function is not smooth. 6.034 - Spring 13

Multi-Layer Perceptron Learning Any set of training points can be separated by a three-layer perceptron network. Almost any set of points separable by two-layer perceptron network. But, no efficient learning rule is known. Could we use gradient ascent/descent? We would need smoothness: small change in weights produces small change in output. Threshold function is not smooth. Use a smooth threshold function! 6.034 - Spring 14

Sigmoid Unit y s(z) w w 0 w w n 1 2... z = 1 n! wixi s( z) = " z i 1 + e -1 x 1 x 2 x n 6.034 - Spring 15

Sigmoid Unit y y w 0 w 1 w 2 x 1-1 x 2 y=1/2 x 1 x 2 6.034 - Spring 16

Training y z 3 y( x, w) w is a vector of weights x is a vector of inputs -1 z 1 w 01 w 03 w 13 y 1 w 23 y 2 w 21 w 22 z 2 w 02-1 w 11 w 12-1 x 1 x 2 y = z 1 z 2 s( w13s( w11x1 + w21x2! w01) + w23s( w12x1 + w22x2! w02)! w03) z 3 6.034 - Spring 17

Training y z 3 y( x, w) w is a vector of weights x is a vector of inputs y i is desired output: Error over the training set for a given weight vector: E = 1 2 i!( y( x, w) " i Our goal is to find weight vector that minimizes error y z 1 z 2 i w 01 2 ) -1 z 1 w 03 w 13 y 1 w 23 y 2 w 21 w 22 z 2 w 02-1 w 11 w 12-1 x 1 x 2 y = s( w13s( w11x1 + w21x2! w01) + w23s( w12x1 + w22x2! w02)! w03) z 3 6.034 - Spring 18

E " ( w w = E 1 2 = Gradient Descent i!( y( x, w) " i w # w " $! we y i 2 ) i i i!( y( x, w) # y )" wy( x, w) i & ' y ' y y = $,, %' w1 ' w n #! " Error on training set Gradient of Error Gradient Descent 6.034 - Spring 19

Training Neural Nets without overfitting, hopefully Given: Data set, desired outputs and a neural net with m weights. Find a setting for the weights that will give good predictive performance on new data. Estimate expected performance on new data. 1. Split data set (randomly) into three subsets: Training set used for picking weights Validation set used to stop training Test set used to evaluate performance 2. Pick random, small weights as initial values 3. Perform iterative minimization of error over training set. 4. Stop when error on validation set reaches a minimum (to avoid overfitting). 5. Repeat training (from step 2) several times (avoid local minima) 6. Use best weights to compute error on test set, which is estimate of performance on new data. Do not repeat training to improve this. Can use cross-validation if data set is too small to divide into three subsets. 6.034 - Spring 50

Training vs. Test Error 6.034 - Spring 51

Training vs. Test Error 6.034 - Spring 52

Overfitting is not unique to neural nets 1-Nearest Neighbors Decision Trees 6.034 - Spring 53

Overfitting in SVM Radial Kernel!=0.1 Radial Kernel!=1 6.034 - Spring 54

On-line vs off-line There are two approaches to performing the error minimization: On-line training present x i and y i* (chosen randomly from the training set). Change the weights to reduce the error on this instance. Repeat. Off-line training change weights to reduce the total error on training set (sum over all instances). On-line training is an approximation to gradient descent since the gradient based on one instance is noisy relative to the full gradient (based on all instances). This can be beneficial in pushing the system out of shallow local minima. 6.034 - Spring 55

Feature Selection In many machine learning applications, there are huge numbers of features text classification (# words) gene arrays (5,000 50,000) images (512 x 512 pixels) 6.034 - Spring 03 1

Feature Selection In many machine learning applications, there are huge numbers of features text classification (# words) gene arrays (5,000 50,000) images (512 x 512 pixels) Too many features make algorithms run slowly risk overfitting 6.034 - Spring 03 2

Feature Selection In many machine learning applications, there are huge numbers of features text classification (# words) gene arrays (5,000 50,000) images (512 x 512 pixels) Too many features make algorithms run slowly risk overfitting Find a smaller feature space subset of existing features new features constructed from old ones 6.034 - Spring 03 3

6.034 - Spring 03 4 Feature Ranking For each feature, compute a measure of its relevance to the output Choose the k features with the highest rankings Correlation between feature j and output Correlation measures how much x tends to deviate from its mean on the same examples on which y deviates from its mean!!! " " " " = i i i j i j i i j i j y y x x y y x x j R 2 2 ) ( ) ( ) )( ( ) (! = i i j x j n x 1! = i i y n y 1

Correlations in Heart Data Top 15 of 25 thal=1-0.52 cp=4 0.51 thal=3 0.48 ca=0-0.48 oldpeak 0.42 thalach -0.42 exang 0.42 slope=1-0.38 slope=2 0.35 cp=3-0.31 sex 0.28 ca=2 0.27 cp=2-0.25 ca=1 0.23 age 0.23 1 ca = 0 no 0 1 thal = 1 yes ca=0 exang chest-pain age < 57.5 0 1 oldpk<3.2 0 1 0 6.034 - Spring 03 5

Correlations in MPG > 22 data cyl=4 0.82 displacement -0.77 weight -0.77 horsepower -0.67 cyl=8-0.58 origin=1-0.54 model-year 0.44 origin=3 0.40 cyl=6-0.37 acceleration 0.35 origin=2 0.26 displacement > 189.5 weight > 2224.5 1 year > 78.5 weight > 2775 1 1 0 0 6.034 - Spring 03 6

XOR Bites Back As usual, functions with XOR in them will cause us trouble Each feature will, individually, have a correlation of 0 (it occurs positively as much as negatively for positive outputs) To solve XOR, we need to look at groups of features together 6.034 - Spring 03 7

Subset Selection Consider subsets of variables too hard to consider all possible subsets wrapper methods: use training set or crossvalidation error to measure the goodness of using different feature subsets with your classifier greedily construct a good subset by adding or subtracting features one by one 6.034 - Spring 03 8

Forward Selection Given a particular classifier you want to use F = {} For each f j Train classifier with inputs F + {f j } Add f j that results in lowest-error classifier to F Continue until F is the right size, or error has quit decreasing 6.034 - Spring 03 9

Forward Selection Given a particular classifier you want to use F = {} For each f j Train classifier with inputs F + {f j } Add f j that results in lowest-error classifier to F Continue until F is the right size, or error has quit decreasing Decision trees, by themselves, do something similar to this 6.034 - Spring 03 10

Forward Selection Given a particular classifier you want to use F = {} For each f j Train classifier with inputs F + {f j } Add f j that results in lowest-error classifier to F Continue until F is the right size, or error has quit decreasing Decision trees, by themselves, do something similar to this Trouble with XOR 6.034 - Spring 03 11

Backward Elimination Given a particular classifier you want to use F = all features For each f j Train classifier with inputs F - {f j } Remove f j that results in lowest-error classifier from F Continue until F is the right size, or error increases too much 6.034 - Spring 03 12

Forward Selection on Auto Data crossvalidation accuracy 0.95 0.94 Forward Selection - Auto 0.93 10-way X-val Accuracy 0.92 0.91 Series1 0.9 0.89 1 2 3 4 5 6 7 8 9 10 11 Number Features number of features added 6.034 - Spring 03 13

Backward Elimination on Auto Data crossvalidation accuracy 0.96 0.95 Backward Selection - Auto 0.94 0.93 10-way X-val Accuracy 0.92 0.91 0.9 0.89 Series1 0.88 0.87 0.86 0.85 1 2 3 4 5 6 7 8 9 10 11 Number of Features Eliminated number of features eliminated 6.034 - Spring 03 14

Forward Selection on Heart Data crossvalidation accuracy 0.86 0.84 Forward Selection - Hear 0.82 10-way X-val Accuracy 0.8 0.78 Series1 0.76 0.74 0.72 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Number of features number of features added 6.034 - Spring 03 15

Backward Elimination on Heart Data Backward Elimination - Heart crossvalidation accuracy 0.86 0.84 0.82 0.8 10-way X-val Accuray 0.78 0.76 0.74 0.72 0.7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Number of features eliminated number of features eliminated 6.034 - Spring 03 16

Recursive Feature Elimination Train a linear SVM or neural network Remove the feature with the smallest weight Repeat More efficient than regular backward elimination Requires only one training phase per feature 6.034 - Spring 03 17

Clustering Form clusters of inputs Map the clusters into outputs Given a new example, find its cluster, and generate the associated output 6.034 - Spring 03 18

Clustering Form clusters of inputs Map the clusters into outputs Given a new example, find its cluster, and generate the associated output 6.034 - Spring 03 19

Clustering Form clusters of inputs Map the clusters into outputs Given a new example, find its cluster, and generate the associated output 6.034 - Spring 03 20

Clustering Criteria small distances between points within a cluster large distances between clusters Need a distance measure, as in nearest neighbor 6.034 - Spring 03 21

Tries to minimize K-Means Clustering k $ $ j=1 i#s j x i " µ j 2 squared dist from point to mean # of clusters elements of cluster j mean of elts in cluster j Only gets, greedily, to a local optimum 6.034 - Spring 03 22

Choose k K-means Algorithm Randomly choose k points Cj to be cluster centers 6.034 - Spring 03 23

Choose k K-means Algorithm Randomly choose k points Cj to be cluster centers Loop Partition the data into k classes Sj according to which of the Cj they re closest to For each Sj, compute the mean of its elements and let that be the new cluster center 6.034 - Spring 03 24

Choose k K-means Algorithm Randomly choose k points Cj to be cluster centers Loop Partition the data into k classes Sj according to which of the Cj they re closest to For each Sj, compute the mean of its elements and let that be the new cluster center Stop when centers quit moving Guaranteed to terminate 6.034 - Spring 03 25

Choose k K-means Algorithm Randomly choose k points Cj to be cluster centers Loop Partition the data into k classes Sj according to which of the Cj they re closest to For each Sj, compute the mean of its elements and let that be the new cluster center Stop when centers quit moving Guaranteed to terminate If a cluster becomes empty, re-initialize the center 6.034 - Spring 03 26

K-Means Example 6.034 - Spring 03 27

K-Means Example 6.034 - Spring 03 28

K-Means Example 6.034 - Spring 03 29

K-Means Example 6.034 - Spring 03 30

K-Means Example 6.034 - Spring 03 31

K-Means Example 6.034 - Spring 03 32

K-Means Example 6.034 - Spring 03 33

K-Means Example 6.034 - Spring 03 34

K-Means Example 6.034 - Spring 03 35

K-Means Example 6.034 - Spring 03 36

Principal Components Analysis Given an n-dimensional real-valued space, data are often nearly restricted to a lower-dimensional subspace PCA helps us find such a subspace whose coordinates are linear functions of the originals 6.034 - Spring 03 37

Principal Components Analysis Given an n-dimensional real-valued space, data are often nearly restricted to a lower-dimensional subspace PCA helps us find such a subspace whose coordinates are linear functions of the originals 6.034 - Spring 03 38

Principal Components Analysis Given an n-dimensional real-valued space, data are often nearly restricted to a lower-dimensional subspace PCA helps us find such a subspace whose coordinates are linear functions of the originals http://www.okstate.edu/artsci/ botany/ordinate/pca.htm 6.034 - Spring 03 39

Cartoon of algorithm Normalize the data (subtract mean, divide by stdev) 6.034 - Spring 03 40

Cartoon of algorithm Normalize the data (subtract mean, divide by stdev) Find the line along which the data has the most variability: that s the first principal component 6.034 - Spring 03 41

Cartoon of algorithm Normalize the data (subtract mean, divide by stdev) Find the line along which the data has the most variability: that s the first principal component Project the data into the n-1 dimensional space orthogonal to the line Repeat 6.034 - Spring 03 42

Cartoon of algorithm Normalize the data (subtract mean, divide by stdev) Find the line along which the data has the most variability: that s the first principal component Project the data into the n-1 dimensional space orthogonal to the line Repeat Result is a new orthogonal set of axes First k give a lower-d space that represents the variability of the data as well as possible 6.034 - Spring 03 43

Cartoon of algorithm Normalize the data (subtract mean, divide by stdev) Find the line along which the data has the most variability: that s the first principal component Project the data into the n-1 dimensional space orthogonal to the line Repeat Result is a new orthogonal set of axes First k give a lower-d space that represents the variability of the data as well as possible Really: find the eigenvectors of the covariance matrix with the k largest eigenvalues 6.034 - Spring 03 44

Linear Transformations Only There are fancier methods that can find this structure 6.034 - Spring 03 45

Insensitive to Classification Task 6.034 - Spring 03 46

Insensitive to Classification Task There are fancier methods that can take class into account 6.034 - Spring 03 47

Validating a Classifier predicted y 0 1 true y 0 A B 1 C D 6.034 - Spring 03 48

Validating a Classifier true y predicted y 0 1 0 A B 1 C D false positive type 1 error 6.034 - Spring 03 49

Validating a Classifier false negative type 2 error true y predicted y 0 1 0 A B 1 C D false positive type 1 error 6.034 - Spring 03 50

false negative type 2 error Validating a Classifier true y predicted y 0 1 0 A B 1 C D false positive type 1 error sensitivity: P(predict 1 actual 1) = D/(C+D) true positive rate (TP) 6.034 - Spring 03 51

false negative type 2 error Validating a Classifier true y predicted y 0 1 0 A B 1 C D false positive type 1 error sensitivity: P(predict 1 actual 1) = D/(C+D) true positive rate (TP) specificity: P(predict 0 actual 0) = A/(A+B) 6.034 - Spring 03 52

false negative type 2 error Validating a Classifier true y predicted y 0 1 0 A B 1 C D false positive type 1 error sensitivity: P(predict 1 actual 1) = D/(C+D) true positive rate (TP) specificity: P(predict 0 actual 0) = A/(A+B) false-alarm rate: P(predict 1 actual 0) = B/(A+B) false positive rate (FP) 6.034 - Spring 03 53

Cost Sensitivity Predict whether a patient has pseuditis based on blood tests Disease is often fatal if left untreated Treatment is cheap and side-effect free 6.034 - Spring 03 54

Cost Sensitivity Predict whether a patient has pseuditis based on blood tests Disease is often fatal if left untreated Treatment is cheap and side-effect free Which classifier to use? Classifier 1: TP = 0.9, FP = 0.4 6.034 - Spring 03 55

Cost Sensitivity Predict whether a patient has pseuditis based on blood tests Disease is often fatal if left untreated Treatment is cheap and side-effect free Which classifier to use? Classifier 1: TP = 0.9, FP = 0.4 Classifier 2: TP = 0.7, FP = 0.1 6.034 - Spring 03 56

Build Costs into Classifier Assess costs of both types of error use a different splitting criterion for decision trees make error function for neural nets asymmetric; different costs for each kind of error use different values of C for SVMs depending on kind of error 6.034 - Spring 03 57

Tunable Classifiers Classifiers that have a threshold (naïve Bayes, neural nets, SVMs) can be adjusted, post learning, by changing the threshold, to make different tradeoffs between type 1 and type 2 errors 6.034 - Spring 03 58

Tunable Classifiers Classifiers that have a threshold (naïve Bayes, neural nets, SVMs) can be adjusted, post learning, by changing the threshold, to make different tradeoffs between type 1 and type 2 errors C 1,C 2 : costs of errors P: percentage of positive examples x: tunable threshold TP(x): true positive rate at threshold x FP(x): false positive rate at threshold x Expected Cost = C 2 P(1-TP(x)) + C 1 (1-P)FP(x) choose x to minimize expected cost 6.034 - Spring 03 59

ROC Curves receiver operating characteristics ideal 1 TP 0 FP 1 6.034 - Spring 03 60

ROC Curves receiver operating characteristics ideal 1 always output 1 TP 0 always output 0 FP 1 6.034 - Spring 03 61

ROC Curves receiver operating characteristics ideal 1 always output 1 TP parametric function of x 0 always output 0 FP 1 6.034 - Spring 03 62

ROC Curves receiver operating characteristics ideal 1 always output 1 TP blue curve dominates red 0 always output 0 FP 1 6.034 - Spring 03 63

Many more issues! Missing data Many examples in one class, few in other (fraud detection) Expensive data (active learning) 6.034 - Spring 03 64