# Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio

Save this PDF as:

Size: px
Start display at page:

Download "Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio"

## Transcription

2 Classification means grouping the observations based on a predictable attribute. Each case contains a set of attributes, of which one is the classification attribute (the predictable attribute). In our application, this attribute is called RESTANTIER. The operation consists in finding a model that describes the predictable attribute as a function of other attributes taken as input values. Basically, the objective of the classification is to first analyze the training data and develop a model for each class using the attributes available in the data. Such class descriptions are then used to classify future independent test data or to develop a better description for each class. In order to train a classification model, the class values of each case in the data set must be known, values that are usually found in historical data [7]. We review the most popular classification algorithms used in our study: logistic regression (LR), Bayesian networks (BN) and support vector machines (SVM). A Logistic Regression model is a special case of linear regression models and it specifies that an appropriate function of the fitted probability of the event is a linear function of the observed values of the available explanatory variables. The major advantage of this approach is that it can produce a simple probabilistic formula of classification. [1] In logistic regression, there is no definition of the coefficient of determination (R 2 ) which is frequently used in the general linear model. R 2 has the desired interpretability as the proportion of variation of the dependent variable, which can be explained by the predictor variables of a given regression model. [6] Bayesian networks show in a graphical manner the dependencies between variables. Based on the network structure, it can realize various kinds of inferences. There are two key optimization-related issues when using Bayesian networks. First, when some of the nodes in the network are not observable, that is, there is no data for the values of the attributes corresponding to those nodes, finding the most likely values of the conditional probabilities can be formulated as a non-linear mathematical program. In practice this is usually solved using a simple steepest descent approach. The second optimization problem occurs when the structure of the network is unknown, which can be formulated as a combinatorial optimization problem [5]. Support Vector Machines have been implemented in many financial applications recently, mainly in the area of time series prediction and classification. Support Vector Machine is a classification and regression prediction tool that uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data. [2] The algorithm is considering the following steps: a) Class separation: we must find the optimal separating hyper plane between the two classes. Linear programming can be used to obtain both linear and non-linear discrimination models between separable data points or instances. If this hyper plane exists, then there are many such planes. The optimal separating hyper plane is the one that maximizes the sum of the distances from the plane to the closest positive example and the closest negative example. b) Overlapping classes: data points on the wrong side of the discriminated margin are weighted down; c) Nonlinearity: when we cannot find a linear separator, data points are projected into a higherdimensional space where the data points effectively become linearly separable (this projection is realised via kernel techniques); d) Problem solution: the whole task can be formulated as a quadratic optimization problem which can be solved by specific techniques Data The study is based on financial data in 2009 from an important bank in Romania and the target customers are credit card holders. Among the instances, 1489 record arrears. The research involves normalization of attribute values and a binary variable as response variable: 1 for the default situation and 0 for non-default. Explanatory variables or the attributes are found in the scoring model proposed by us that includes: Credit amount (the amount limit by which the debtor may have multiple withdrawals and repayments from the credit card), Credit balance, Opening date of the credit account, credit product identification, Category, Currency, Client s name, Gender, Age, Marital status, Profession, Client with history (including other banking products and payment history), Deposit state, Amounts of deposits opened in the bank, if applicable, Scoring rate (scoring points from 1 to 6, 1 is for the best and 6 for the weakest debtor). Data was divided into two tables, one used for model construction and one for testing and validating the model Applying the algorithm on all attributes Oracle Data Mining enables the application of the algorithm on the recorded observations and the ISBN:

3 possibility to configure the models. For results interpretation, there are made available a series of suggestive views. Thus, based on each client attributes found in the table above, we determine the customer profile in terms of solvency (good payer, bad payer). Let s interpret the results on cost matrix and confusion matrix obtained from the construction and testing of each model. These are generated from the Accuracy module. Accuracy refers to the percentage of correct predictions made by the model compared to the real classifications from the test data. In the following figure, the top screen shows the cost matrix and the second screen shows the confusion matrix. As we can see, for LG model, there are instances with 0 value (debtors who repay) from which 98,15% were correct predicted, and default cases, with 1 value, from which the model predicted correctly 93,22%. The total cost is 411 records. The confusion matrix displays the model errors, as follows: the figure 1 from left cell bottom shows false negative predictions, predictions of 0 when the real value is 1 (the client is a good payer but he records default) and 310 from the right cell above indicates false positive predictions, predictions of 1 when the actual value is 0 (the customer is a bad payer but he repays the loan). LG model made ( ) correct predictions and 411 ( ) wrong predictions from a total of instances. The error rate is 0,022. The false positive rate is 0,067 and the false negative rate is 0,018. In case of Naive Bayes model, there are instances with 0 value (good payers) from which 95,85% were correct predicted, and default cases, with 1 value, from which the model predicted correctly 99,87%. The total cost is 781,28. The confusion matrix displays the model errors, as follows: NB model made ( ) correct predictions and 697 ( ) wrong Figure 1 - Cost confusion matrix for LG model predictions from a total of instances. The error rate is 0,038. The false positive rate is 0,001 and the false negative rate is 0,041. In case of Support Vector Machines, there are instances with 0 value from which 97,37% were correct predicted, and default cases, with 1 value, from which the model predicted correctly 98,52%. The total cost is 462 u.m. The confusion matrix displays the model errors, as follows: SVM model made ( ) correct predictions and 462 ( ) wrong predictions from a total of instances. The error rate is 0,025. The false positive rate is 0,014 and the false negative rate is 0,026. In conclusion, we can say that in terms of cost and predictions, logistic regression showed both the lowest cost and the lowest error rate followed by a very small distance of Support Vector Machines, the error rates being almost similar. 3. Improving the model for default analysis 3.1. Attribute importance algorithm Oracle Data Mining owns a specific characteristic named Algorithm Importance (AI) which is based on Minimum Description Length (MDL) algorithm that orders the attributes by significance in making predictions. Any system, method, or program using ISBN:

4 AI to reduce time and computing resources necessary to build data mining models, goes through a process of selecting a subset of original attributes by eliminating redundant, irrelevant or informal ones and identify those attributes that can be extremely useful in making predictions. The main idea behind the MDL principle is that any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than the number of symbols needed to describe the data literally. [3]. In this study, the objective is to build a predictive model in order to identify those borrowers with a high probability of default and to get a clear distinction between valuable customers of the bank and the less beneficial for the credit system. The first question we ask is: "what features should have a good payer customer? What represents him the best? ". In order to receive a smart response, we apply the MDL algorithm on the following attributes: Figure 2 - Attributes on which the MDL algorithm is applied The result is summarized in the chart below and the table includes the attributes listed in order of their importance and significance: Figure 3 - The results after applying AI algorithm ISBN:

5 It is good to mention that the scoring points and the amount of liquidity held by the client in deposits (if any) are extremely important for credit decision making. We also believe that this information is very useful both individually and also for improving the predictive models built, especially Naive Bayes and Support Vector Machines Applying the algorithm after the attributes selection After the selection of the most important attributes, we will continue applying data mining algorithms, the results being interpreted as follows: For LG model, there were predicted correctly 98,88% out of instances with 0 value (good payers) and 90,8% out of default cases, afferent to 1 value. The total cost is 324 u.m. The confusion matrix displays the following errors: there are 137 false negative predictions, and 187 false positive predictions. LG model made ( ) correct predictions and 324 ( ) wrong predictions out of a total of instances. The error rate is 0,017. In case of NB, there were predicted correctly 95,81% out of instances with 0 value and 99,93% out of default cases, with 1 value. The total cost is 776,65 u.m. The confusion matrix displays the following errors: one false negative prediction and 702 false positive predictions. NB model made ( ) correct predictions and 703 ( ) wrong predictions out of a total of instances. The error rate is 0,038. In case of SVM, there were predicted correctly 97,71% out of instances with 0 value and 98,52% out of default cases, with 1 value. The total cost is 406 u.m. The confusion matrix displays the following errors: there are 22 false negative predictions, and 384 false positive predictions. SVM model made ( ) correct predictions and 406 ( ) wrong predictions out of a total of instances. The error rate is 0, Comparative analysis of the models after applying Attribute Importance algorithm After applying the three models on the set of instances, we obtained a series of predictions for credits reimbursement. We have centralized these results in a virtual table that contains data about each customer (name, account, ID number), the real value of the RESTANŢIER attribute, predictions made by the three models and the likelihood of achieving these predictions (see fig. 4). Figure 4 - Comparative results of predictions ISBN:

### Latest Trends in Information Technology

Solutions for CRM systems integration in organizational systems Lect. Phd. ADELA BARA Lect. Assist ALEXANDRA FLOREA Lect. Assist IULIANA BOTHA SIMONA OPREA Computer Science Department Academy of Economic

### Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

### Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

### Data Mining Technology Based on Bayesian Network Structure Applied in Learning

, pp.67-71 http://dx.doi.org/10.14257/astl.2016.137.12 Data Mining Technology Based on Bayesian Network Structure Applied in Learning Chunhua Wang, Dong Han College of Information Engineering, Huanghuai

### DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes and a class attribute

### CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

### Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

### Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

### Supervised vs unsupervised clustering

Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

### Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

### Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

### Clustering Analysis based on Data Mining Applications Xuedong Fan

Applied Mechanics and Materials Online: 203-02-3 ISSN: 662-7482, Vols. 303-306, pp 026-029 doi:0.4028/www.scientific.net/amm.303-306.026 203 Trans Tech Publications, Switzerland Clustering Analysis based

### Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Data Mining: Classifier Evaluation CSCI-B490 Seminar in Computer Science (Data Mining) Predictor Evaluation 1. Question: how good is our algorithm? how will we estimate its performance? 2. Question: what

### Data mining techniques for actuaries: an overview

Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of

### Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree

World Applied Sciences Journal 21 (8): 1207-1212, 2013 ISSN 1818-4952 IDOSI Publications, 2013 DOI: 10.5829/idosi.wasj.2013.21.8.2913 Decision Making Procedure: Applications of IBM SPSS Cluster Analysis

### Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

### Penalizied Logistic Regression for Classification

Penalizied Logistic Regression for Classification Gennady G. Pekhimenko Department of Computer Science University of Toronto Toronto, ON M5S3L1 pgen@cs.toronto.edu Abstract Investigation for using different

### ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

INSIGHTS@SAS: ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA AGENDA 09.00 09.15 Intro 09.15 10.30 Analytics using SAS Enterprise Guide Ellen Lokollo 10.45 12.00 Advanced Analytics using SAS

### Data Mining with R Programming Language for Optimizing Credit Scoring in Commercial Bank

INTERNATIONAL BLACK SEA UNIVERSITY FACULTY OF COMPUTER TECHNOLOGIES AND ENGINEERING Ph.D. PROGRAM Data Mining with R Programming Language for Optimizing Credit Scoring in Commercial Bank Dilmurodzhon Zakirov

### Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

### A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

### CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

### Web Data mining-a Research area in Web usage mining

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

### Part I: Data Mining Foundations

Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

### Study on Classifiers using Genetic Algorithm and Class based Rules Generation

2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

### Spam Classification Documentation

Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:

### Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

Volume 6, Issue 5, May 016 ISSN: 77 18X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Fuzzy Logic in Online

### Study on the Application Analysis and Future Development of Data Mining Technology

Study on the Application Analysis and Future Development of Data Mining Technology Ge ZHU 1, Feng LIN 2,* 1 Department of Information Science and Technology, Heilongjiang University, Harbin 150080, China

### CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

### CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Fall, 2015!1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function

### Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

### I211: Information infrastructure II

Data Mining: Classifier Evaluation I211: Information infrastructure II 3-nearest neighbor labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1

### Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

### Support Vector Machines + Classification for IR

Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines

### Classification Algorithms in Data Mining

August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

### Data Mining Concepts

Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential

### Jarek Szlichta

Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns

### A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

### Semi-supervised Learning

Semi-supervised Learning Piyush Rai CS5350/6350: Machine Learning November 8, 2011 Semi-supervised Learning Supervised Learning models require labeled data Learning a reliable model usually requires plenty

### Didacticiel - Études de cas

Subject In some circumstances, the goal of the supervised learning is not to classify examples but rather to organize them in order to point up the most interesting individuals. For instance, in the direct

### Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

### CISC 4631 Data Mining

CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) 1 Classification:

### Function Algorithms: Linear Regression, Logistic Regression

CS 4510/9010: Applied Machine Learning 1 Function Algorithms: Linear Regression, Logistic Regression Paula Matuszek Fall, 2016 Some of these slides originated from Andrew Moore Tutorials, at http://www.cs.cmu.edu/~awm/tutorials.html

### Weka ( )

Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

### Louis Fourrier Fabien Gaie Thomas Rolf

CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

### A Random Forest based Learning Framework for Tourism Demand Forecasting with Search Queries

University of Massachusetts Amherst ScholarWorks@UMass Amherst Tourism Travel and Research Association: Advancing Tourism Research Globally 2016 ttra International Conference A Random Forest based Learning

### Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

### Bayes Net Learning. EECS 474 Fall 2016

Bayes Net Learning EECS 474 Fall 2016 Homework Remaining Homework #3 assigned Homework #4 will be about semi-supervised learning and expectation-maximization Homeworks #3-#4: the how of Graphical Models

### Best Customer Services among the E-Commerce Websites A Predictive Analysis

www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 5 Issues 6 June 2016, Page No. 17088-17095 Best Customer Services among the E-Commerce Websites A Predictive

### Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

### A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

### CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,

### CS570: Introduction to Data Mining

CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

### Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

### Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

### Machine Learning for. Artem Lind & Aleskandr Tkachenko

Machine Learning for Object Recognition Artem Lind & Aleskandr Tkachenko Outline Problem overview Classification demo Examples of learning algorithms Probabilistic modeling Bayes classifier Maximum margin

### Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

### All lecture slides will be available at CSC2515_Winter15.html

CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

### Predicting Popular Xbox games based on Search Queries of Users

1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

### We are IntechOpen, the first native scientific publisher of Open Access books. International authors and editors. Our authors are among the TOP 1%

We are IntechOpen, the first native scientific publisher of Open Access books 3,350 108,000 1.7 M Open access books available International authors and editors Downloads Our authors are among the 151 Countries

### Introduction to Support Vector Machines

Introduction to Support Vector Machines CS 536: Machine Learning Littman (Wu, TA) Administration Slides borrowed from Martin Law (from the web). 1 Outline History of support vector machines (SVM) Two classes,

### Pattern Recognition for Neuroimaging Data

Pattern Recognition for Neuroimaging Data Edinburgh, SPM course April 2013 C. Phillips, Cyclotron Research Centre, ULg, Belgium http://www.cyclotron.ulg.ac.be Overview Introduction Univariate & multivariate

### Data analysis case study using R for readily available data set using any one machine learning Algorithm

Assignment-4 Data analysis case study using R for readily available data set using any one machine learning Algorithm Broadly, there are 3 types of Machine Learning Algorithms.. 1. Supervised Learning

### Text Classification for Spam Using Naïve Bayesian Classifier

Text Classification for E-mail Spam Using Naïve Bayesian Classifier Priyanka Sao 1, Shilpi Chaubey 2, Sonali Katailiha 3 1,2,3 Assistant ProfessorCSE Dept, Columbia Institute of Engg&Tech, Columbia Institute

### Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint

### What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

Data Mining Andrew Kusiak Intelligent Systems Laboratory 2139 Seamans Center The University it of Iowa Iowa City, IA 52242-1527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Tel. 319-335

### Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

### Introduction to Data Mining

Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,

Advanced Data Mining Techniques David L. Olson Dursun Delen Advanced Data Mining Techniques Dr. David L. Olson Department of Management Science University of Nebraska Lincoln, NE 68588-0491 USA dolson3@unl.edu

### Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

### Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

### Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

### Evaluating Classifiers

Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

### Data Science Tutorial

Eliezer Kanal Technical Manager, CERT Daniel DeCapria Data Scientist, ETC Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 2017 SEI SEI Data Science in in Cybersecurity Symposium

### EPL451: Data Mining on the Web Lab 5

EPL451: Data Mining on the Web Lab 5 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Predictive modeling techniques IBM reported in June 2012 that 90% of data available

### Encoding Words into String Vectors for Word Categorization

Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

### .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Definitions Data. Consider a set A = {A 1,...,A n } of attributes, and an additional

### Face Detection using Hierarchical SVM

Face Detection using Hierarchical SVM ECE 795 Pattern Recognition Christos Kyrkou Fall Semester 2010 1. Introduction Face detection in video is the process of detecting and classifying small images extracted

### Multimedia Content Delivery in Mobile Cloud: Quality of Experience model

Multimedia Content Delivery in Mobile Cloud: Quality of Experience model Aleksandar Karadimce Danco Davcev Faculty of Computer Science and Engineering University Ss Cyril and Methodius, Skopje, R. Macedonia

### PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 15-1: Support Vector Machines Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

### Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions

### Why Do Nearest-Neighbour Algorithms Do So Well?

References Why Do Nearest-Neighbour Algorithms Do So Well? Brian D. Ripley Professor of Applied Statistics University of Oxford Ripley, B. D. (1996) Pattern Recognition and Neural Networks. CUP. ISBN 0-521-48086-7.

### Classification. 1 o Semestre 2007/2008

Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

### A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

International Journal of Innovation and Scientific Research ISSN 2351-8014 Vol. 12 No. 1 Nov. 2014, pp. 217-222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issr-journals.org/

### Neural Network in Data Mining

Neural Network in Data Mining Karan S Nayak Dept. of Electronics & Telecommunicaion, Dwarkadas J. Sanghvi College of Engg., Mumbai, India Abstract Data Mining means mine data from huge amount of data.

### Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

### Binary Association Rule Mining Using Bayesian Network

2011 International Conference on Information and Network Technology IPCSIT vol.4 (2011) (2011) IACSIT Press, Singapore Binary Association Rule Mining Using Bayesian Network Venkateswara Rao Vedula 1 and

### Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

### Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

### Introduction to Artificial Intelligence

Introduction to Artificial Intelligence COMP307 Evolutionary Computing 3: Genetic Programming for Regression and Classification Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline Statistical parameter regression Symbolic

### Pedestrian Detection with Improved LBP and Hog Algorithm

Open Access Library Journal 2018, Volume 5, e4573 ISSN Online: 2333-9721 ISSN Print: 2333-9705 Pedestrian Detection with Improved LBP and Hog Algorithm Wei Zhou, Suyun Luo Automotive Engineering College,

### Support Vector Machines

Support Vector Machines SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing

### Fast or furious? - User analysis of SF Express Inc

CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

### 2. On classification and related tasks

2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

### CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:

CS299 Detailed Plan Shawn Tice February 5, 2013 Overview The high-level steps for classifying web pages in Yioop are as follows: 1. Create a new classifier for a unique label. 2. Train it on a labelled

### Web Usage Mining: A Research Area in Web Mining

Web Usage Mining: A Research Area in Web Mining Rajni Pamnani, Pramila Chawan Department of computer technology, VJTI University, Mumbai Abstract Web usage mining is a main research area in Web mining