CREDIT RISK MODELING IN R. Finding the right cut-off: the strategy curve

Similar documents
List of Exercises: Data Mining 1 December 12th, 2015

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

Evaluating Classifiers

Probabilistic Classifiers DWML, /27

Data Mining and Knowledge Discovery: Practice Notes

ECE 5470 Classification, Machine Learning, and Neural Network Review

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluating Classifiers

CS145: INTRODUCTION TO DATA MINING

Data Mining and Knowledge Discovery Practice notes 2

CLASSIFICATION JELENA JOVANOVIĆ. Web:

Data Mining and Knowledge Discovery: Practice Notes

Interpretation and evaluation

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

July 7, 2015: New Version of textir

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Pattern recognition (4)

Classification Part 4

Random Forest A. Fornaser

Network Traffic Measurements and Analysis

Package smbinning. December 1, 2017

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Ben Baumer Instructor

MEASURING CLASSIFIER PERFORMANCE

CS4491/CS 7265 BIG DATA ANALYTICS

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Business Club. Decision Trees

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

Introduction to Machine Learning

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

7 Techniques for Data Dimensionality Reduction

CS570: Introduction to Data Mining

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Chapter 3: Supervised Learning

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Weka ( )

Boolean Classification

Evaluating Machine-Learning Methods. Goals for the lecture

Non-Linearity of Scorecard Log-Odds

Predictive modelling / Machine Learning Course on Big Data Analytics

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

CS249: ADVANCED DATA MINING

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

Classification. Data Set Iris. Logistic Regression. Species. Petal.Width

Classification and Regression

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Decision trees. For this lab, we will use the Carseats data set from the ISLR package. (install and) load the package with the data set

The exam is closed book, closed notes except your one-page cheat sheet.

Feature Subset Selection using Clusters & Informed Search. Team 3

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Classification Algorithms in Data Mining

Cyber-physical intrusion detection on a robotic vehicle

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Expectation Maximization (EM) and Gaussian Mixture Models

Data Mining: STATISTICA

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Evaluating Classifiers

Cost-Sensitive Classifier Selection Using the ROC Convex Hull Method

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Classification. Instructor: Wei Ding

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Data Mining Concepts & Techniques

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Lecture 25: Review I

Bagging-Based Logistic Regression With Spark: A Medical Data Mining Method

Learning video saliency from human gaze using candidate selection

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Tutorials Case studies

Machine Learning and Bioinformatics 機器學習與生物資訊學

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Data Analytics. Qualification Exam, May 18, am 12noon

EE795: Computer Vision and Intelligent Systems

CSE 158. Web Mining and Recommender Systems. Midterm recap

S2 Text. Instructions to replicate classification results.

CISC 4631 Data Mining

RCASPAR: A package for survival time analysis using high-dimensional data

Machine Learning: Symbolische Ansätze

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

Tutorial on Machine Learning Tools

Data Mining Classification: Bayesian Decision Theory

Multi-voxel pattern analysis: Decoding Mental States from fmri Activity Patterns

CSE 190, Fall 2015: Midterm

Advanced Data Mining Techniques

Principles of Machine Learning

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis

Classification using Weka (Brain, Computation, and Neural Learning)

DATA MINING OVERFITTING AND EVALUATION

Detecting malware even when it is encrypted

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Transcription:

CREDIT RISK MODELING IN R Finding the right cut-off: the strategy curve

Constructing a confusion matrix > predict(log_reg_model, newdata = test_set, type = "response") 1 2 3 4 5 0.08825517 0.3502768 0.28632298 0.1657199 0.11264550 > predict(class_tree, new data = test_set) 0 1 1 0.7873134 0.2126866 2 0.6250000 0.3750000 3 0.6250000 0.3750000 4 0.7873134 0.2126866 5 0.5756867 0.4243133

Cut-off? > pred_log_regression_model <- predict(log_reg_model, newdata = test_set, type = response") > cutoff <- 0.14 > class_pred_logit <- ifelse(pred_log_regression_model > cutoff, 1, 0)?

A certain strategy > log_model_full <- glm(loan_status ~., family = "binomial", data = training_set) > predictions_all_full<- predict(log_reg_model, newdata = test_set, type = response") > cutoff <- quantile(predictions_all_full, 0.8) cutoff 80% 0.1600124 > pred_full_20 <- ifelse(predictions_all_full > cutoff, 1, 0)

A certain strategy (continued) > true_and_predval <- cbind(test_set$loan_status, pred_full_20) true_and_predval test_set$loan_status pred_full_20 1 0 0 2 0 0 3 0 1 4 0 0 5 0 1 > accepted_loans <- pred_and_trueval[pred_full_20 == 0,1] > bad_rate <- sum(accepted_loans)/length(accepted_loans) > bad_rate [1] 0.08972541

The strategy table accept_rate cutoff bad_rate [1,] 1.00 0.5142 0.1069 [2,] 0.95 0.2122 0.0997 [3,] 0.90 0.1890 0.0969 [4,] 0.85 0.1714 0.0927 [5,] 0.80 0.1600 0.0897 [6,] 0.75 0.1471 0.0861 [7,] 0.70 0.1362 0.0815 [8,] 0.65 0.1268 0.0766 [16,] 0.25 0.0644 0.0425 [17,] 0.20 0.0590 0.0366 [18,] 0.15 0.0551 0.0371 [19,] 0.10 0.0512 0.0309 [20,] 0.05 0.0453 0.0247 [21,] 0.00 0.0000 0.0000

The strategy curve Credit Risk Modeling in R

CREDIT RISK MODELING IN R Let s practice!

CREDIT RISK MODELING IN R Let s practice!

CREDIT RISK MODELING IN R The ROC-curve

Until now strategy table/curve : still make assumption what is overall best model?

Confusion matrix model prediction no default (0) default (1) actual loan status no default TN FP (0) default (1) FN TP

Accuracy? accuracy when cut-off

The ROC-curve Cut-off = 0 Cut-off = 1

The ROC-curve Credit Risk Modeling in R

The ROC-curve Credit Risk Modeling in R

The ROC-curve Credit Risk Modeling in R

Which one is better? AUC ROC-curve A = 0.75 A AUC ROC-curve B = 0.78 B

CREDIT RISK MODELING IN R Let s practice!

CREDIT RISK MODELING IN R Input selection based on the AUC

ROC curves for 4 logistic regression models model with 7 variables models with 4 variables

AUC-based pruning 1) Start with a model including all variables (in our case, 7) and compute AUC > log_model_full <- glm(loan_status ~ loan_amnt + grade + home_ownership + annual_inc + age + emp_cat + ir_cat, family = "binomial", data = training_set) > predictions_model_full <- predict(log_model_full, newdata = test_set, type = response") > AUC_model_full <- auc(test_set$loan_status, predictions_model_full) Area under the curve: 0.6512

AUC-based pruning 2) Build 7 new models, where each time one of the variables is removed, and make PD-predictions using the test set log_1_remove_amnt <- glm(loan_status ~ grade + home_ownership + annual_inc + age + emp_cat + ir_cat, family = "binomial", data = training_set) log_1_remove_grade <- glm(loan_status ~ loan_amnt + home_ownership + annual_inc + age + emp_cat + ir_cat, family = "binomial", data = training_set) log_1_remove_home <- glm(loan_status ~ loan_amnt + grade + annual_inc + age + emp_cat + ir_cat, family = "binomial", data = training_set) pred_1_remove_amnt <- predict(log_1_remove_amnt, newdata = test_set, type = "response") pred_1_remove_grade <- predict(log_1_remove_grade, newdata = test_set, type = "response") pred_1_remove_home <- predict(log_1_remove_home, newdata = test_set, type = "response")

AUC-based pruning 3) Keep the model that led to the best AUC (AUC full model: 0.6512) > auc(test_set$loan_status, pred_1_remove_amnt) Area under the curve: 0.6514 > auc(test_set$loan_status, pred_1_remove_grade) Area under the curve: 0.6438 > auc(test_set$loan_status, pred_1_remove_home) Area under the curve: 0.6537 Remove variable home_ownership 4) Repeat until AUC decreases (significantly)

CREDIT RISK MODELING IN R Let s practice!

CREDIT RISK MODELING IN R Course wrap-up

Other methods Discriminant analysis Random forest Neural networks Support vector machines

But very classification-focused Timing aspect is neglected New popular method: survival analysis PD s that change over time time-varying covariates can be included

CREDIT RISK MODELING IN R The end!