MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

Similar documents

A Short Introduction to the caret Package

A Short Introduction to the caret Package

Credit card Fraud Detection using Predictive Modeling: a Review

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Lecture 25: Review I

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

RENR690-SDM. Zihaohan Sang. March 26, 2018

Evaluating Classifiers

CREDIT RISK MODELING IN R. Finding the right cut-off: the strategy curve

Evaluating Classifiers

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Tutorials Case studies

List of Exercises: Data Mining 1 December 12th, 2015

Function Algorithms: Linear Regression, Logistic Regression

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Data Mining and Knowledge Discovery: Practice Notes

Machine Learning: Symbolische Ansätze

Decision trees. For this lab, we will use the Carseats data set from the ISLR package. (install and) load the package with the data set

CS4491/CS 7265 BIG DATA ANALYTICS

CS145: INTRODUCTION TO DATA MINING

Classification Part 4

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

MEASURING CLASSIFIER PERFORMANCE

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

MIPE: Model Informing Probability of Eradication of non-indigenous aquatic species. User Manual. Version 2.4

CITS4009 Introduction to Data Science

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

As a reference, please find a version of the Machine Learning Process described in the diagram below.

Classification Algorithms in Data Mining

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

I211: Information infrastructure II

Package smbinning. December 1, 2017

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery: Practice Notes

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

Machine Learning and Bioinformatics 機器學習與生物資訊學

Filtering Bug Reports for Fix-Time Analysis

CS249: ADVANCED DATA MINING

Package gbts. February 27, 2017

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Fast or furious? - User analysis of SF Express Inc

July 7, 2015: New Version of textir

DATA MINING OVERFITTING AND EVALUATION

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Data Mining and Knowledge Discovery: Practice Notes

Nina Zumel and John Mount Win-Vector LLC

1 Machine Learning System Design

MACHINE LEARNING TOOLBOX. Random forests and wine

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

Evaluating Machine-Learning Methods. Goals for the lecture

Tutorial on Machine Learning Tools

Data Mining with R Programming Language for Optimizing Credit Scoring in Commercial Bank

Classification. Instructor: Wei Ding

WEKA Explorer User Guide for Version 3-4

Logistic Regression: Probabilistic Interpretation

Clustering Large Credit Client Data Sets for Classification with SVM

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Data Mining and Knowledge Discovery: Practice Notes

Louis Fourrier Fabien Gaie Thomas Rolf

Classification. Data Set Iris. Logistic Regression. Species. Petal.Width

Effective Classifiers for Detecting Objects

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Stat 4510/7510 Homework 4

Discriminant analysis in R QMMA

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Product Catalog. AcaStat. Software

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Cover Page. The handle holds various files of this Leiden University dissertation.

CISC 4631 Data Mining

S2 Text. Instructions to replicate classification results.

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Probabilistic Classifiers DWML, /27

Didacticiel - Études de cas

For continuous responses: the Actual by Predicted plot how well the model fits the models. For a perfect fit, all the points would be on the diagonal.

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Pattern recognition (4)

Package rocc. February 20, 2015

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Evaluating Machine Learning Methods: Part 1

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

ML pipelines at OK. Dmitry Bugaychenko

Fraud Detection using Machine Learning

Principles of Machine Learning

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Classifying Depositional Environments in Satellite Images

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Package StatMeasures

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Transcription:

MACHINE LEARNING TOOLBOX Logistic regression on Sonar

Classification models Categorical (i.e. qualitative) target variable Example: will a loan default? Still a form of supervised learning Use a train/test split to evaluate performance Use the Sonar dataset Goal: distinguish rocks from mines

Example: Sonar data > # Load the Sonar dataset > library(mlbench) > data(sonar) > # Look at the data > Sonar[1:6, c(1:5, 61)] V1 V2 V3 V4 V5 Class 1 0.0200 0.0371 0.0428 0.0207 0.0954 R 2 0.0453 0.0523 0.0843 0.0689 0.1183 R 3 0.0262 0.0582 0.1099 0.1083 0.0974 R 4 0.0100 0.0171 0.0623 0.0205 0.0205 R 5 0.0762 0.0666 0.0481 0.0394 0.0590 R 6 0.0286 0.0453 0.0277 0.0174 0.0384 R

Splitting the data Randomly split data into training and test sets Use a 60/40 split, instead of 80/20 Sonar dataset is small, so 60/40 gives a larger, more reliable test set

Splitting the data # Randomly order the dataset > rows <- sample(nrow(sonar)) > Sonar <- Sonar[rows, ] # Find row to split on > split <- round(nrow(sonar) *.60) > train <- Sonar[1:split, ] > test <- Sonar[(split + 1):nrow(Sonar), ] # Confirm test set size > nrow(train) / nrow(sonar) [1] 0.6009615

MACHINE LEARNING TOOLBOX Let s practice!

MACHINE LEARNING TOOLBOX Confusion matrix

Confusion matrix Reference Yes No Prediction Yes True positive False positive No False negative True negative

Confusion matrix # Fit a model > model <- glm(class ~., family = binomial(link = "logit"), train) > p <- predict(model, test, type = "response") > summary(p) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.0000 0.9885 0.5296 1.0000 1.0000 # Turn probabilities into classes and look at their frequencies > p_class <- ifelse(p >.50, "M", "R") > table(p_class) p_class M R 44 39

Confusion matrix Make a 2-way frequency table Compare predicted vs. actual classes # Make simple 2-way frequency table > table(p_class, test[["class"]]) p_class M R M 13 31 R 30 9

Confusion matrix # Use caret s helper function to calculate additional statistics > confusionmatrix(p_class, test[["class"]]) Reference Prediction M R M 13 31 R 30 9 Accuracy : 0.2651 95% CI : (0.1742, 0.3734) No Information Rate : 0.5181 P-Value [Acc > NIR] : 1 Kappa : -0.4731 Mcnemar's Test P-Value : 1 Sensitivity : 0.3023 Specificity : 0.2250 Pos Pred Value : 0.2955 Neg Pred Value : 0.2308

MACHINE LEARNING TOOLBOX Let s practice!

MACHINE LEARNING TOOLBOX Class probabilities and class predictions

Different thresholds Not limited to 50% threshold 10% would catch more mines with less certainty 90% would catch fewer mines with more certainty Balance true positive and false positive rates Cost-benefit analysis

Confusion matrix # Use a larger cutoff > p_class <- ifelse(p >.99, "M", "R") > table(p_class) p_class M R 41 42 # Make simple 2-way frequency table > table(p_class, test[["class"]]) p_class M R M 13 28 R 30 12

Confusion matrix with caret # Use caret to produce confusion matrix > confusionmatrix(p_class, test[["class"]]) Reference Prediction M R M 13 28 R 30 12 Accuracy : 0.3012 95% CI : (0.2053, 0.4118) No Information Rate : 0.5181 P-Value [Acc > NIR] : 1.0000 Kappa : -0.397 Mcnemar's Test P-Value : 0.8955 Sensitivity : 0.3023 Specificity : 0.3000 Pos Pred Value : 0.3171 Neg Pred Value : 0.2857

MACHINE LEARNING TOOLBOX Let s practice!

MACHINE LEARNING TOOLBOX Introducing the ROC curve

The challenge Many possible classification thresholds Requires manual work to choose Easy to overlook a particular threshold Need a more systematic approach

ROC curves Plot true/false positive rate at every possible threshold Visualize tradeoffs between two extremes 100% true positive rate vs. 0% false positive rate Result is an ROC curve Developed as a method for analyzing radar signals

An example ROC curve # Create ROC curve > library(catools) > colauc(p, test[["class"]], plotroc = TRUE) X-axis: false positive rate Y-axis: true positive rate Each point along the curve represents a different threshold

MACHINE LEARNING TOOLBOX Let s practice!

MACHINE LEARNING TOOLBOX Area under the curve (AUC)

From ROC to AUC Machine Learning Toolbox

Defining AUC Single-number summary of model accuracy Summarizes performance across all thresholds Rank different models within the same dataset

Defining AUC Ranges from 0 to 1 0.5 = random guessing 1 = model always right 0 = model always wrong Rule of thumb: AUC as a letter grade 0.9 = "A" 0.8 = "B"

MACHINE LEARNING TOOLBOX Let s practice!