Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio

Similar documents
Latest Trends in Information Technology

Data mining with Support Vector Machine

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Lecture 9: Support Vector Machines

Contents. Preface to the Second Edition

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Credit card Fraud Detection using Predictive Modeling: a Review

Enterprise Miner Tutorial Notes 2 1

Link Prediction for Social Network

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Performance Evaluation of Various Classification Algorithms

Applying Supervised Learning

Non-Linearity of Scorecard Log-Odds

Naïve Bayes for text classification

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Lecture 25: Review I

CSE 158. Web Mining and Recommender Systems. Midterm recap

The Procedure Proposal of Manufacturing Systems Management by Using of Gained Knowledge from Production Data

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

KBSVM: KMeans-based SVM for Business Intelligence

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Supervised Learning Classification Algorithms Comparison

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Keyword Extraction by KNN considering Similarity among Features

Expectation Maximization (EM) and Gaussian Mixture Models

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Fraud Detection using Machine Learning

Generative and discriminative classification techniques

Supervised vs unsupervised clustering

Solutions for analyzing CRM systems - data mining algorithms

Data mining techniques for actuaries: an overview

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Clustering Analysis based on Data Mining Applications Xuedong Fan

Application of Support Vector Machine Algorithm in Spam Filtering

Security Solutions for Privacy Preserving Improved Data Mining

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Chapter 3: Supervised Learning

Data Mining with R Programming Language for Optimizing Credit Scoring in Commercial Bank

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

SOCIAL MEDIA MINING. Data Mining Essentials

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Ascertaining Apropos Mining Algorithms for Business Applications

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

k-nearest Neighbor (knn) Sept Youn-Hee Han

Customer Clustering using RFM analysis

Introduction to Data Mining and Data Analytics

SVM Classification in Multiclass Letter Recognition System

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

Oracle9i Data Mining. Data Sheet August 2002

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Performance Analysis of Data Mining Classification Techniques

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Oracle9i Data Mining. An Oracle White Paper December 2001

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Study on the Application Analysis and Future Development of Data Mining Technology

Introduction to Automated Text Analysis. bit.ly/poir599

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

Correlation Based Feature Selection with Irrelevant Feature Removal

Data Mining Concepts

Image and Video Quality Assessment Using Neural Network and SVM

Data Preprocessing. Slides by: Shree Jaswal

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Intrusion Detection Using Data Mining Technique (Classification)

data-based banking customer analytics

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

A Survey on Postive and Unlabelled Learning

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Classification Algorithms in Data Mining

Stat 602X Exam 2 Spring 2011

CS 8520: Artificial Intelligence

Data Mining Concepts & Techniques

CS145: INTRODUCTION TO DATA MINING

Visualization and text mining of patent and non-patent data

I211: Information infrastructure II

Bachelor of Applied Finance (Financial Planning)

Penalizied Logistic Regression for Classification

Clustering Large Credit Client Data Sets for Classification with SVM

Lies, Damned Lies and Statistics Using Data Mining Techniques to Find the True Facts.

6.034 Quiz 2, Spring 2005

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Chapter 1, Introduction

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CSE4334/5334 DATA MINING

Impact of Encryption Techniques on Classification Algorithm for Privacy Preservation of Data

CS 229 Midterm Review

Transcription:

Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio Adela Ioana Tudor, Adela Bâra, Simona Vasilica Oprea Department of Economic Informatics and Cybernetics Academy of Economic Studies, Bucharest, Romania adela_sw@yahoo.com; bara.adela@ie.ase.ro; simona.oprea@transelectrica.ro Abstract: Predictions about the borrowers ability to repay is the essence of credit risk management. A number of approaches and methodologies have been developed and implemented to date, but research continues to improve existing techniques in order to obtain optimal solutions, and design new models. In this context, we will examine different methods of data extraction and analysis in terms of their accuracy, in order to assess their comparative effectiveness for determining the probability of default. The data mining techniques used in this paper are Logistic Regression, Naive Bayes algorithm and Support Vector Machines. We consider very interesting and useful the fact that the probability of default is calculated for each debtor individually, and not for groups of customers with specific features, as practiced in most banks today. Key words: data mining, classification, regression, support vector machine, attributes importance, business intelligence. 1. Introduction In terms of lending, there are two major approaches to model credit risk at the transaction level: analytical approach based on accounting and statistical predictions. Because there is a large amount of information about the client in all financial business, a properly way of implementing an effective model is to use their own database together with data mining techniques. Business Intelligence (BI) solutions provides all the necessary functions to identify, integrate and analyze data, offering decision support at strategic management level. Data Mining (DM), as part of BI systems, has enjoyed great popularity in recent years, with advances in both research and commercialization. Data mining is focused on assessing the predictive power of models and performs analysis that would be too hard-working and time-consuming by using traditional statistical methods. Data mining technology can solve business problems in banking and finance areas by finding attributes, causalities and correlations that are not immediately obvious to managers, because the volume of data is too large or too quickly generated on the screen. The main idea of the article is to build a model based on data mining techniques that can predict customer behaviour over time. Thus, we performed a comparative analysis of data mining techniques to assess the probability of default for credit card holders and determine the best statistical methods of prediction, in our view, for analyzing this indicator. The methods implemented are: Logistic Regression, Naive Bayes algorithm and Support Vector Machines. The paper is structures as follows: Section 2 describes the model for data analysis based on a specific scoring model that we proposed. In section 3, we apply the attribute importance algorithm and improve the initial model in order to make a comparative analysis. Section 4 shows the conclusions. For our study we combine methods with supervised and unsupervised learning mechanisms. For testing different methods we used Oracle Data Mining (ODM) that is organized around several generic operations, providing an unified interface for extraction and discovery functions. These operations include functions for construction, implementation, testing and manipulation of data to create models. ODM implements a series of algorithms for classification, prediction, regression, clustering, association, selection, and data analysis. Oracle Data Miner provides the following options for each stage: for transforming the data and build models (build), for testing the results (test) and for the evaluation and application on new data sets (apply). 2. The model for data analysis ISBN: 978-1-61804-134-0 117

Classification means grouping the observations based on a predictable attribute. Each case contains a set of attributes, of which one is the classification attribute (the predictable attribute). In our application, this attribute is called RESTANTIER. The operation consists in finding a model that describes the predictable attribute as a function of other attributes taken as input values. Basically, the objective of the classification is to first analyze the training data and develop a model for each class using the attributes available in the data. Such class descriptions are then used to classify future independent test data or to develop a better description for each class. In order to train a classification model, the class values of each case in the data set must be known, values that are usually found in historical data [7]. We review the most popular classification algorithms used in our study: logistic regression (LR), Bayesian networks (BN) and support vector machines (SVM). A Logistic Regression model is a special case of linear regression models and it specifies that an appropriate function of the fitted probability of the event is a linear function of the observed values of the available explanatory variables. The major advantage of this approach is that it can produce a simple probabilistic formula of classification. [1] In logistic regression, there is no definition of the coefficient of determination (R 2 ) which is frequently used in the general linear model. R 2 has the desired interpretability as the proportion of variation of the dependent variable, which can be explained by the predictor variables of a given regression model. [6] Bayesian networks show in a graphical manner the dependencies between variables. Based on the network structure, it can realize various kinds of inferences. There are two key optimization-related issues when using Bayesian networks. First, when some of the nodes in the network are not observable, that is, there is no data for the values of the attributes corresponding to those nodes, finding the most likely values of the conditional probabilities can be formulated as a non-linear mathematical program. In practice this is usually solved using a simple steepest descent approach. The second optimization problem occurs when the structure of the network is unknown, which can be formulated as a combinatorial optimization problem [5]. Support Vector Machines have been implemented in many financial applications recently, mainly in the area of time series prediction and classification. Support Vector Machine is a classification and regression prediction tool that uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data. [2] The algorithm is considering the following steps: a) Class separation: we must find the optimal separating hyper plane between the two classes. Linear programming can be used to obtain both linear and non-linear discrimination models between separable data points or instances. If this hyper plane exists, then there are many such planes. The optimal separating hyper plane is the one that maximizes the sum of the distances from the plane to the closest positive example and the closest negative example. b) Overlapping classes: data points on the wrong side of the discriminated margin are weighted down; c) Nonlinearity: when we cannot find a linear separator, data points are projected into a higherdimensional space where the data points effectively become linearly separable (this projection is realised via kernel techniques); d) Problem solution: the whole task can be formulated as a quadratic optimization problem which can be solved by specific techniques. 2.1. Data The study is based on financial data in 2009 from an important bank in Romania and the target customers are credit card holders. Among the 18239 instances, 1489 record arrears. The research involves normalization of attribute values and a binary variable as response variable: 1 for the default situation and 0 for non-default. Explanatory variables or the attributes are found in the scoring model proposed by us that includes: Credit amount (the amount limit by which the debtor may have multiple withdrawals and repayments from the credit card), Credit balance, Opening date of the credit account, credit product identification, Category, Currency, Client s name, Gender, Age, Marital status, Profession, Client with history (including other banking products and payment history), Deposit state, Amounts of deposits opened in the bank, if applicable, Scoring rate (scoring points from 1 to 6, 1 is for the best and 6 for the weakest debtor). Data was divided into two tables, one used for model construction and one for testing and validating the model. 2.2. Applying the algorithm on all attributes Oracle Data Mining enables the application of the algorithm on the recorded observations and the ISBN: 978-1-61804-134-0 118

possibility to configure the models. For results interpretation, there are made available a series of suggestive views. Thus, based on each client attributes found in the table above, we determine the customer profile in terms of solvency (good payer, bad payer). Let s interpret the results on cost matrix and confusion matrix obtained from the construction and testing of each model. These are generated from the Accuracy module. Accuracy refers to the percentage of correct predictions made by the model compared to the real classifications from the test data. In the following figure, the top screen shows the cost matrix and the second screen shows the confusion matrix. As we can see, for LG model, there are 16.750 instances with 0 value (debtors who repay) from which 98,15% were correct predicted, and 1.489 default cases, with 1 value, from which the model predicted correctly 93,22%. The total cost is 411 records. The confusion matrix displays the model errors, as follows: the figure 1 from left cell bottom shows false negative predictions, predictions of 0 when the real value is 1 (the client is a good payer but he records default) and 310 from the right cell above indicates false positive predictions, predictions of 1 when the actual value is 0 (the customer is a bad payer but he repays the loan). LG model made 17.828 (16.440 + 1.388) correct predictions and 411 (101 + 310) wrong predictions from a total of 18.239 instances. The error rate is 0,022. The false positive rate is 0,067 and the false negative rate is 0,018. In case of Naive Bayes model, there are 16.750 instances with 0 value (good payers) from which 95,85% were correct predicted, and 1.489 default cases, with 1 value, from which the model predicted correctly 99,87%. The total cost is 781,28. The confusion matrix displays the model errors, as follows: NB model made 17.542 (16.055 + 1.487) correct predictions and 697 (695 + 2) wrong Figure 1 - Cost confusion matrix for LG model predictions from a total of 18.239 instances. The error rate is 0,038. The false positive rate is 0,001 and the false negative rate is 0,041. In case of Support Vector Machines, there are 16.750 instances with 0 value from which 97,37% were correct predicted, and 1.489 default cases, with 1 value, from which the model predicted correctly 98,52%. The total cost is 462 u.m. The confusion matrix displays the model errors, as follows: SVM model made 17.777 (16.310 + 1.467) correct predictions and 462 (440 + 22) wrong predictions from a total of 18.239 instances. The error rate is 0,025. The false positive rate is 0,014 and the false negative rate is 0,026. In conclusion, we can say that in terms of cost and predictions, logistic regression showed both the lowest cost and the lowest error rate followed by a very small distance of Support Vector Machines, the error rates being almost similar. 3. Improving the model for default analysis 3.1. Attribute importance algorithm Oracle Data Mining owns a specific characteristic named Algorithm Importance (AI) which is based on Minimum Description Length (MDL) algorithm that orders the attributes by significance in making predictions. Any system, method, or program using ISBN: 978-1-61804-134-0 119

AI to reduce time and computing resources necessary to build data mining models, goes through a process of selecting a subset of original attributes by eliminating redundant, irrelevant or informal ones and identify those attributes that can be extremely useful in making predictions. The main idea behind the MDL principle is that any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than the number of symbols needed to describe the data literally. [3]. In this study, the objective is to build a predictive model in order to identify those borrowers with a high probability of default and to get a clear distinction between valuable customers of the bank and the less beneficial for the credit system. The first question we ask is: "what features should have a good payer customer? What represents him the best? ". In order to receive a smart response, we apply the MDL algorithm on the following attributes: Figure 2 - Attributes on which the MDL algorithm is applied The result is summarized in the chart below and the table includes the attributes listed in order of their importance and significance: Figure 3 - The results after applying AI algorithm ISBN: 978-1-61804-134-0 120

It is good to mention that the scoring points and the amount of liquidity held by the client in deposits (if any) are extremely important for credit decision making. We also believe that this information is very useful both individually and also for improving the predictive models built, especially Naive Bayes and Support Vector Machines. 3.2. Applying the algorithm after the attributes selection After the selection of the most important attributes, we will continue applying data mining algorithms, the results being interpreted as follows: For LG model, there were predicted correctly 98,88% out of 16.750 instances with 0 value (good payers) and 90,8% out of 1.489 default cases, afferent to 1 value. The total cost is 324 u.m. The confusion matrix displays the following errors: there are 137 false negative predictions, and 187 false positive predictions. LG model made 17.915 (16.563 + 1.352) correct predictions and 324 (137 + 187) wrong predictions out of a total of 18.239 instances. The error rate is 0,017. In case of NB, there were predicted correctly 95,81% out of 16.750 instances with 0 value and 99,93% out of 1.489 default cases, with 1 value. The total cost is 776,65 u.m. The confusion matrix displays the following errors: one false negative prediction and 702 false positive predictions. NB model made 17.536 (16.048 + 1.488) correct predictions and 703 (702 + 1) wrong predictions out of a total of 18.239 instances. The error rate is 0,038. In case of SVM, there were predicted correctly 97,71% out of 16.750 instances with 0 value and 98,52% out of 1.489 default cases, with 1 value. The total cost is 406 u.m. The confusion matrix displays the following errors: there are 22 false negative predictions, and 384 false positive predictions. SVM model made 17.833 (16.366 + 1.467) correct predictions and 406 (22 + 384) wrong predictions out of a total of 18.239 instances. The error rate is 0,022. 3.3. Comparative analysis of the models after applying Attribute Importance algorithm After applying the three models on the set of instances, we obtained a series of predictions for credits reimbursement. We have centralized these results in a virtual table that contains data about each customer (name, account, ID number), the real value of the RESTANŢIER attribute, predictions made by the three models and the likelihood of achieving these predictions (see fig. 4). Figure 4 - Comparative results of predictions ISBN: 978-1-61804-134-0 121

Comparative results show that only 845 of 18 239 records are incorrect predictions, representing 4.63% of total. We also made a comparison of the incorrect predictions released by the three models. As we have noticed before, SVM registered 406 incorrect predictions (2,22%), NB recorded 703 incorrect predictions (3,85%), and LG registered only 324 wrong predictions (1,77%). From cost point of view, LG recorded the lowest cost, followed by SVM, and NB is detaching pretty much. As a conclusion, on our observations, logistic regression showed the best results, but the other two models registered also small errors (less than 5%). We can consider that all three models can be successfully applied in banking practice. 4. Conclusions Studies show that in recent years there have seen a dramatic explosion in the level of interest in data mining. The users wanted to take advantage of the tools offered by this technology in order to gain an intelligent competitive advantage. The management of financial risks has many dimensions and involves many types of decisions. The importance of this article comes from the complex issue of credit risk management in order to assure financial stability. Credit risk is considered the most dangerous category of banking risk and in order to prevent it, banks must meet a series of regulations and use different predictive models. Therefore, credit default is a major research area due to its effect. In this research various attribute selection and data mining techniques have been used to build a few predictive models for banking practice. It has been found that Logistic Regression performs very well in predicting the defaulters, followed closely by the Support Vector Machines. Further work is under progress to apply the results in terms of effects generated by the current international crisis. Also, this paper presents some results of the research project PN II, TE Program, Code 332: Informatics Solutions for decision making support in the uncertain and unpredictable environments in order to integrate them within a Grid network, financed within the framework of People research program. References [1] Cheng I- Yeh, Lien Che-hui, The comparisons of data mining techniques for the predictive accuracy of probability of default. Expert Systems with Applications, 2009; [2] Oracle Data Mining Concepts, http://docs.oracle.com/cd/b28359_01/datamine.111 /b28129.pdf; [3] Grünwald, P., The Minimum Description Length Principle, 1998; [4] Hamel Ge, A., M. Knapp and N. Wildenauer. Default and recovery correlations, 2007. [5] Olafsson S, Xiaonan Li, Shuning Wu, Operations research and data mining, European Journal of Operational Research 187, 1429 1448, 2008. [6] Ţiţan E, Tudor A, Conceptual and statistical issues regarding the probability of default and modelling default risk, Database Systems Journal, Vol. II, No. 1/2011, pg. 13-22; [7] Tudor A., Bara A., Botha I. - Solutions for analyzing CRM systems - data mining algorithms, International Journal of Computers, Issue 4, Volume 5, 2011, pg. 485-493, ISSN: 1998-4308; Acknowledgements This article is a result of the project POSDRU/88/1.5./S/55287 Doctoral Programme in Economics at European Knowledge Standards (DOESEC)". This project is co-funded by the European Social Fund through The Sectorial Operational Programme for Human Resources Development 2007-2013, coordinated by The Bucharest Academy of Economic Studies in partnership with West University of Timisoara. ISBN: 978-1-61804-134-0 122