Interpretable Machine Learning with Applications to Banking

Similar documents
Deep insights into interpretability of machine learning algorithms and applications to risk management

Random Forest A. Fornaser

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Applying Supervised Learning

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Fraud Detection using Machine Learning

Using Machine Learning to Optimize Storage Systems

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

What is machine learning?

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Nonparametric Methods Recap

CS 229 Midterm Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Nonparametric Approaches to Regression

PARALLEL CLASSIFICATION ALGORITHMS

Overview and Practical Application of Machine Learning in Pricing

Credit card Fraud Detection using Predictive Modeling: a Review

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Machine Learning: An Applied Econometric Approach Online Appendix

A Systematic Overview of Data Mining Algorithms

Machine Learning Part 1

Cyber attack detection using decision tree approach

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Classification and Regression

Topics in Machine Learning

Lecture 25: Review I

Allstate Insurance Claims Severity: A Machine Learning Approach

Chapter 3: Supervised Learning

Data mining with Support Vector Machine

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Artificial Intelligence. Programming Styles

Network Traffic Measurements and Analysis

Data mining techniques for actuaries: an overview

CSE4334/5334 DATA MINING

Machine Learning (CSE 446): Practical Issues

Supervised Learning for Image Segmentation

Classification and Regression Trees

Trade-offs in Explanatory

Practical Guidance for Machine Learning Applications

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Practical OmicsFusion

5 Learning hypothesis classes (16 points)

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Applied Statistics for Neuroscientists Part IIa: Machine Learning

An introduction to random forests

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Classification: Decision Trees

4. Feedforward neural networks. 4.1 Feedforward neural network structure

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

The Basics of Decision Trees

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Pre-Requisites: CS2510. NU Core Designations: AD

Machine Learning / Jan 27, 2010

Business Club. Decision Trees

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions

Lecture 9: Support Vector Machines

Fast or furious? - User analysis of SF Express Inc

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

10 things I wish I knew. about Machine Learning Competitions

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Python With Data Science

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

Lecture 7: Decision Trees

Data mining with sparse grids

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Classification/Regression Trees and Random Forests

From Building Better Models with JMP Pro. Full book available for purchase here.

Supervised vs unsupervised clustering

Classification and Regression Trees

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Contents. Preface to the Second Edition

Information theory methods for feature selection

Classification. Slide sources:

Nesnelerin İnternetinde Veri Analizi

Introduction to Data Mining and Data Analytics

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Detecting malware even when it is encrypted

SOCIAL MEDIA MINING. Data Mining Essentials

Lecture 20: Bagging, Random Forests, Boosting

node2vec: Scalable Feature Learning for Networks

AKA: Logistic Regression Implementation

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

Graph mining assisted semi-supervised learning for fraudulent cash-out detection

Logistic Regression and Gradient Ascent

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

CARTWARE Documentation

Transcription:

Interpretable Machine Learning with Applications to Banking Linwei Hu Advanced Technologies for Modeling, Corporate Model Risk Wells Fargo October 26, 2018 2018 Wells Fargo Bank, N.A. All rights reserved. For public use

Agenda Machine Learning in Banking Machine Learning Model Interpretation Locally Interpretable Model (focus of this talk) Global Diagnostics Explainable Neural Networks Example 2

AI/ML Applications in Banking Traditionally: Statistical and econometrics techniques used for Model development: Core models and challenge models Model validation: Benchmarking and comparisons Examples: Credit decision, PD and Revenue modeling, Fraud detection, Fair lending, etc. Emerging challenges: Large data sets (n and p) deficiency of traditional approaches New applications: Text analytics, Natural Language Processing, etc. Examples: Chat bots, complaint analysis, customer assistance, etc. Rapid adoption of Artificial Intelligence (AI)/Machine Learning (ML) across Financial Institutions Address new challenges Improve: business decisions, customer experiences and risk management 3

Examples: Credit Risk Models Stress Testing Predict expected losses: PD, LGD, EAD, (PD*LGD*EAD) Predict under multiple time horizons and various micro-economic variables Statistical Models Survival analysis Regression: LGD, etc. Semi-parametric Models Varying coefficient models Machine Learning Random Forests Gradient Boosting Machines Neural Nets: LSTM 4

Examples of Financial Crime Models Anti Money Laundering: to prevent and detect potential money laundering activities Fraud Detection: to prevent and detect fraudulent activities Statistical Model Rule-Based System Clustering Machine Learning One class SVM Supervised and semisupervised machine learning PU learning 5

Challenges: Predictive performance + automation come at a cost Models are complex and hard to interpret No analytical expressions Potential problems with multi-collinearity Ambiguities in attribution credit scoring May not conform to subject matter knowledge Inclusion of key variables, monotonicity constraints Tuning of hyper-parameters is complex and computationally intensive Tendency to put too much faith in automated algorithms in fact, now they deserve more scrutiny 6

Machine Learning Model Interpretation Machine learning interpretation is an active research area now. In our team (Advanced Technologies for Modeling), we have a couple of research project conducted in this area. Locally Interpretable Model (focus) Global Diagnostics Explainable Neural networks Data Preprocessing Complex ML Algorithms and Output Local Representations (KLIME, LIME-SUP-R, LIME-SUP-D, etc.) Global Diagnostics (PDP, ICE plot, ALE plot, ATDEV plot, etc.) Structured Models (Explainable Neural Networks) Interactive Visualization 7

Locally Interpretable Model Local interpretation is aimed at interpreting the relationship between input and output over local region, with the idea that a simple parametric model may be used to approximate the input-output relationship, and local variable importance and input-output relationships are easily interpretable from the simple local model. Related tools include: LIME (Ribeiro et al. 2016) KLIME (H2o) LIME-SUP (our approach) 8

LIME LIME (Local Interpretable Model-Agnostic Explanations) is perhaps the first local interpretation method, proposed in Ribeiro et al. (2016). The idea is to approximate the model around a given instance/observation using a linear model in order to explain the prediction: Simulate new instances Predict on the new instances using the machine learning model f(x) Pick a kernel and fit a linear model using the kernel as weight; penalize the complexity of the linear model, for example, fit ridge regression. Available in python (lime package) and R (lime package) 9

KLIME KLIME is a variant of LIME proposed in H2o Driverless AI. It divides the input space into regions and fit a linear model in each region. Cluster the input space using a K-Means algorithm Fit a linear model to the machine learning prediction f(x) in each cluster The number of clusters is chosen by maximizing Rsquare KLIME can be used as a surrogate model (a less accurate but more interpretable substitute of the machine learning model). However, it has some disadvantages: the unsupervised partitioning approaches can be unstable, yielding different partitions with different initial locations. the unsupervised partitioning does not incorporate any model information which seems critical to preserving the underlying model structure. It is less accurate K-means partitions the input space according to the Voronoi diagrams, it is less intuitive in business environment where modelers are more used to rectangle partitioning (segmentation). f(x) LIMESUP partition KLIME partition x KLIME vs LIMESUP partition K-means clustering 10

LIME-SUP LIME-SUP is an internal local interpretation method developed by our team. Similar to KLIME: it also partitions the X-space and fits a simple model in each partition. The key difference from KLIME: it is a supervised partitioning method using information from the machine learning model. The goal is to use supervised partitioning to achieve a more stable, more accurate and more interpretable surrogate model than KLIME. There are two implementations of LIME-SUP. One uses model based tree (LIME- SUP-R) and the other uses partial derivatives (LIME-SUP-D). The two have similar principal but different focus, LIMESUP-R fits better but is more computationally expensive. 11

LIME-SUP-R LIME-SUP-R uses model based tree. Model based tree differs from traditional classification and regress tree (CART) in that it fits a model instead of constant in each tree node. At each node, it works as follows: Fit a parent model to the node Split the node into two child nodes, fit separate child models. The best split is found so that the combined model fit for the two child models is maximized. Keep splitting until certain stopping criterion is met (depth, leaf node size, etc) LIMESUP-R partitions the X-space in a supervised manner by utilizing machine learning model predictions. On the high level it works as follows: Predict on the data using the machine learning model f(x). Predictions can be predicted mean (continuous response) or logodds (binary response). For a specified form of parametric model (say linear regression), fit a model based tree to the predictions using machine learning model predictors. Prune the tree using appropriate model fit statistics. Check model fit, plot tree and coefficients. y = α 0 + β 0 T x y = α 1 + β 1 T x x 1 > 0 x 1 0 y = α 2 + β 2 T x 12

LIME-SUP-D LIME-SUP-D uses partial derivatives f x = f x x from the ML model f(x), derivatives are the coefficients if we fit a linear model to f(x) in the neighborhood of x. We can partition the X-space by grouping the derivatives that are close, since similar derivatives in a region indicates a linear model can fit well in that region. On the high level it works as follows: Compute the partial derivatives f x. The derivatives can be computed using a neural network surrogate model. Fit a regression tree to the multi-dimensional partial derivatives using machine learning predictors. Prune the tree using appropriate model fit statistics. Check model fit, plot tree and coefficients for interpretation. x i, f x i x i, f x i x 1 > 0 x 1 0 x i, f x i 13

Example We use a data with 50 predictor variables and 1 million observations. The response is a binary indicator (default). A gradient boosting model is trained using the 50 variables. Then the top 20 excluding 4 variables (due to correlation), are selected to run LIME-SUP and KLIME. So in total there are 16 variables. The logodds/logits and partial derivatives of the logits are computed and used for LIMESUP-R and LIMESUP- D. 14

Example Figure below shows the tree structure and the coefficients in the terminal nodes, for LIMESUP-R with depth = 3. The strongest patterns in the coefficients exist for ltv_fcast and horizon. Combining the tree structure and the coefficient values, we can see The coefficients for ltv_fcast peaks for middle range ltv (node 13) and are low for low (node 7, 8, 9, 10) and high ltv (node 14). The coefficients for horizon is positive for non-delinquent accounts (dlq_new_clean = 1, node 8 & 12) and negative otherwise (node 7 & 11). The overall effect of horizon is close to 0. This explains the interaction effects. The coefficient for fico is also smaller for high ltv (node 14, red curve), indicating an interaction effect 15

Example Similarly we fit KLIME with 8 clusters. Table below shows the MSE, and Rsquare for the 5 methods. LIME-SUP is better than KLIME, although the difference is not as striking in this case. Besides that, we see LIME-SUP-R fits slightly better than LIME-SUP-D, which is expected. LIME-SUP-R LIME-SUP-D KLIME-E KLIME-M KLIME-P MSE 0.113 0.118 0.137 0.147 0.138 R 2 0.927 0.924 0.911 0.905 0.911 16

Example Figure below provides a different view of the comparisons: values of MSE and R 2 computed within each of eight local regions. The conclusions are similar as before. LIME-SUP does better on all local regions, except LIME-SUP-D has larger mse than KLIME on the KLIME-E regions 2, 4. 17

Reference Liu, Chen, Vaughan, Nair, Sudjianto (2018), Model Interpretation: A Unified Derivative-based Framework for Nonparametric Regression and Supervised Machine Learning, arxiv:1808.07216 Hu, Chen, Nair, Sudjianto (2018), Locally Interpretable Models and Effects based on Supervised Partitioning (LIME- SUP), arxiv:1806.00663 Vaughan, Sudjianto, Brahimi, Chen, Nair (2018), Explainable Neural Networks based on Additive Index Models, arxiv:1806.01933 18