A Solution to PAKDD 07 Data Mining Competition

Similar documents
Lecture 25: Review I

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Logistic Model Trees with AUC Split Criterion for the KDD Cup 2009 Small Challenge

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Classification Algorithms in Data Mining

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Fast or furious? - User analysis of SF Express Inc

SOCIAL MEDIA MINING. Data Mining Essentials

Rank Measures for Ordering

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Louis Fourrier Fabien Gaie Thomas Rolf

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Network Traffic Measurements and Analysis

2. On classification and related tasks

International Journal of Software and Web Sciences (IJSWS)

Predicting Popular Xbox games based on Search Queries of Users

CS145: INTRODUCTION TO DATA MINING

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Data Preprocessing. Supervised Learning

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

CSE 546 Machine Learning, Autumn 2013 Homework 2

Credit card Fraud Detection using Predictive Modeling: a Review

1) Give decision trees to represent the following Boolean functions:

Fraud Detection using Machine Learning

Logistic Model Trees with AUC Split Criterion for the KDD Cup 2009 Small Challenge

Classification with Decision Tree Induction

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

CS 229 Midterm Review

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

CSC 2515 Introduction to Machine Learning Assignment 2

Linear combinations of simple classifiers for the PASCAL challenge

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

An Empirical Study of Lazy Multilabel Classification Algorithms

Application of Additive Groves Ensemble with Multiple Counts Feature Evaluation to KDD Cup 09 Small Data Set

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

CloNI: clustering of JN -interval discretization

Comparison of various classification models for making financial decisions

An Improved KNN Classification Algorithm based on Sampling

CS229 Final Project: Predicting Expected Response Times

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Random projection for non-gaussian mixture models

A New Clustering Algorithm On Nominal Data Sets

Prognosis of Lung Cancer Using Data Mining Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Face Detection Using Look-Up Table Based Gentle AdaBoost

Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem

COMP 465: Data Mining Classification Basics

Slides for Data Mining by I. H. Witten and E. Frank

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lecture 2 :: Decision Trees Learning

Data Mining and Knowledge Discovery: Practice Notes

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Sentiment Analysis for Amazon Reviews

IBL and clustering. Relationship of IBL with CBR

PASS EVALUATING IN SIMULATED SOCCER DOMAIN USING ANT-MINER ALGORITHM

Evaluating Classifiers

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Classification of Imbalanced Marketing Data with Balanced Random Sets

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Effective Classifiers for Detecting Objects

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Predict Employees Computer Access Needs in Company

Ensemble Methods, Decision Trees

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Notes based on: Data Mining for Business Intelligence

Cost-sensitive C4.5 with post-pruning and competition

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

k-nearest Neighbor (knn) Sept Youn-Hee Han

Beyond Bags of Features

Data Engineering. Data preprocessing and transformation

Recent Progress on RAIL: Automating Clustering and Comparison of Different Road Classification Techniques on High Resolution Remotely Sensed Imagery

CS249: ADVANCED DATA MINING

Introduction to Machine Learning

K- Nearest Neighbors(KNN) And Predictive Accuracy

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Development of Prediction Model for Linked Data based on the Decision Tree for Track A, Task A1

Nonparametric Classification Methods

Mining Quantitative Association Rules on Overlapped Intervals

PROBLEM 4

Slice Intelligence!

International Journal of Advance Research in Computer Science and Management Studies

Facial Expression Classification with Random Filters Feature Extraction

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

Enterprise Miner Tutorial Notes 2 1

Combining SVMs with Various Feature Selection Strategies

Transcription:

A Solution to PAKDD 07 Data Mining Competition Ye Wang, Bin Bi Under the supervision of: Dehong Qiu Abstract. This article presents a solution to the PAKDD 07 Data Mining competition. We mainly discuss the main challenge to this problem and our way to solve it. 1 Introduction The PAKDD 07 Data Mining Competition task is a cross-selling problem which is described as follows: A company currently has a customer base of credit card customers as well as a customer base of home loan (mortgage) customers. The company would like to make use of this opportunity to cross-sell home loans to its credit card customers, and the main difficulty is to develop an effective scoring model to predict the potential cross-selling take-ups. A modeling dataset of 40,700 customers with 40 modeling variables (as of the point of application for the company's credit card), plus a target variable, will be provided to the participants. This is a sample of customers who opened a new credit card with the company within a specific 2-year period and who did not have an existing home loan with the company. The target categorical variable "Target_Flag" will have a value of 1 if the customer then opened a home loan with the company within 12 months after opening the credit card (700 random samples), and will have a value of 0 if otherwise (40,000 random samples). A prediction dataset (8,000 sampled cases) will also be provided to the participants with similar variables but withholding the target variable. The data mining task is to produce a score for each customer in the prediction dataset, indicating a credit card customer's propensity to take up a home loan with the company (the higher the score, the higher the propensity). The accuracy of the results will be ranked in terms of AUC. This paper gives a solution to the crossing-selling problem, and the rest of the report is organized as follows: Section 2 discusses the main challenge while solving the problem. Section 3 proposes our method to the task. Section 4 shows some cues revealed by obtained result. 2 Understanding the problem We think the task is difficult because of the following: - The class imbalance. In the training set, the number of people with the

Target_Flag zero is more than fifty times than the number of people with the Target_Flag one. - The time-variant attributes: In the 40 attributes, there are several sequences of attributes which measure the same features in different time (i.e. the four attributes B_ENQ_LAST_WEEK, B-ENQ_L1M, B_ENQ_L3M, B_ENQ_L6M, they discover a trend of the customer s actions, how to exploit useful information form these sequences is a problem we must face.) - There are many unlabelled data, and to get some cues form them is also worth considering. 3 Solutions 3.1 Data preparation In this procedure, we went through standard series of data preparation including: - Partition train data into 80% learning set and 20% testing set - Watch the univariate distributions and frequency of each attributes - Data preprocessing is done by converting categorical values from literal string to integer index. - For missing values occur in the dataset, we replace them with a global value MISSING or simply remove the data fields including them for the reason that they give limited useful information. Other strategies such as statistical regression might be more effective. Due to the time constraint we do not have time to experiment with other methods. - We also note that some data fields show up as little influence to the result, (e.g. DVR_LIC), therefore, for these data fields, we believe that it can be safely removed from the data. 3.2 Resampling technology To solve the main problem, we mainly tried two ways the cost-sensitive learning and the resampling way. In the end, we thought the resampling is better than the cost-sensitive learning, as the training data set is too skew. According to the nature of this problem, we propose a technology which combines under-sampling method and over-sampling method. The method is described as following: 1. Denote dp(totally 700 instances) the positive subset(target_flag = 1) of the training dataset,denote dn(totally 40000 instances) the negative subset(target_flag = 0) of the training dataset 2. Get all instances (700) form dp, and copy them seven times, thus we get 4900 positive instances. Get 4000 random-selected instances form dn, and mix them with the 4900 positive instances. Put the 8900 instances in a new group. Remove the 4000 instances from dn. 3. If dn still have instances, back to Step 2, else process finishes. After this procedure, we actually get 10 groups, each contains 8900 instances. We do not make the number of positive instances a little more than the negative ones as

we want to get a classifier more inclined to classify instances as positive ones and we can get a higher AUC value. In practice it does work as we expected and its effect is also better than the one that has the same positive and negative instances. 3.3 Model Selection In choosing classifiers, we mainly tried three classifiers. First, we choose the C4.5 decision tree to model the problem, but we found the tree model have an over fitting problem which is very difficult to tackle. Then we tried the KNN, and we also found it not effective. Finally we used the logistic regression, decision stump + adaboost and VFI, they all yield good results on the data set. To get a more stable result, we decided to combine the three classifiers mentioned in the end of the last paragraph using the vote mechanism---we compute the average of the results got from the three classifiers. After all groups are processed, we combine them by computing the average of results got in each group. Figure 1 is a ROC curve we obtained when we were doing 10-fold validation. ROC Curve 1.0 0.8 Sensitivity 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity Fig.1. Finally, we use the rule selection to fine-tune the results. In data preparation, we find instances with certain values are almost impossible to have TARGET_FLAG equal to 1. We get the rules from the training dataset and reduce the probabilities of the instances which satisfy the rules. 3.4 Brief overview on technical details

3.4.1 Decision stump algorithm The process of decision stump algorithm is given in Figure 2. A decision stump can be denoted by (Z, c) where Z is a peak, selected from the p = 124 peaks, and c is a thresfold. This stump has two leaves, the left one contains the training sample with Decision stump algorithm: 1. In the training set count the number of examples in class C having value V for attribute A: store this information in a 3-d array, COUNT[C,V,A]. 2. The default class is the one having the most examples in the training set. The accuracy of the default class is the number of training examples in the default class divided by the total number of training examples. 3. FOR EACH NUMERICAL ATTRIBUTE, A: Create a nominal version of A by defining a finite number of intervals of values. These intervals become the "values" of the nominal version of A. definitions: Class C is optimal for attribute A, value V, if it maximizes COUNT[C,V,A]. Class C is optimal for attribute A, interval I, if it maximizes COUNT[C,"interval I",A]. Values are partitioned into intervals so that every interval satisfies the following constraints: (a) there is at least one class that is "optimal" for more than SMALL of the values in the interval. This constraint does not apply to the rightmost interval. (b) If V[I] is the smallest value for attribute A in the training set that is larger than the values in interval I then there is no class C that is optimal both for V[I] and for interval I. 4. FOR EACH ATTRIBUTE, A(use the nominal version of numerical attributes): (a) Construct a hypothesis involving attribute A by selecting, for each value V of A (and also for "missing"), an optimal class for V. If several classes are optimal for a value, choose among them randomly. (b) add the constructed hypothesis to a set called HYPOTHESES. This set will ultimately contain one hypothesis for each attribute. 5. 1R: choose the rule from the set HYPOTHESES having the highest accuracy on the training set (if there are several "best" rules, choose among them at random). 1R*: choose all the rules from HYPOTHESES having an accuracy on the training set greater than the accuracy of the default class. Fig.2. Decision stump algorithm intensity of peak Z less than or equal to the thresfold c, and the right leaf contains all other samples. If most the samples in the left leaf is, say a customer opened a home loan, then the samples with Z c will be classified as a customer opened a home

loan. The classifier is denoted as f(x), where x is a Boolean variable, Z c, and f(x) takes value from {-1, 1}: f(x i ) = 1 if the ith train sample is classified as class one; and f(x i ) = -1 if the ith sample is classified as class two. The sample is misclassified if y i f(x i ) = -1. For the left leaf, Z c, or x = true, let n 11 and n 21 be the numbers of observations with y i = 1 and y i = -1, respectively, i.e., n 11 = I{(y i = 1)&(Z c)}, n 21 = I{(y i = -1)&(Z c)}, (1) Where I{statement} is the indicator function, which equals 1 if the statement is true, 0 otherwise. Similarly, let n 12 and n 22 be the numbers of observations for y i = 1 and y i = -1, respectively, and Z > c, i.e., n 12 = I{(y i = 1)&(Z > c)}, n 22 = I{(y i = -1)&(Z > c)}, (2) Then the log likelihood for this multinomial model is log L = n uv log(p uv ), (3) Where p uv is evaluated by n uv / (n 1v + n 2v ). The peak Z and its threshold c are obtained by maximizing the log likelihood. VFI (Voting Feature Intervals) algorithm: Train(TrainingSet) FOR EACH FEATURE F EndPoints[F] = EndPoints[F] find_end_points(trainingset, F, C); Sort(EndPoints[F]); /* each pair of consecutive points in EndPoints[F] form a featue interval */ FOR EACH INTERVAL I /* on feature F */ /* count the number of instances of class C falling into interval i */ interval_class_count[f, I, C] = count_instances(f, I, C); Classify(e) /* e: example to be classified */ vote[c] = 0; FOR EACH FEATURE F feature_vote[f, C] = 0; /* vote of feature F for class C */ IF Ef value is known I = find_interval(f, Ef) feature_vote[f, C] = interval_class_count(f, I, C) / class_count(c); normalize_feature_votes(f); vote[c] = vote[c] + feature_vote[f, C]; RETURN CLASS C WITH HIGHEST VOTE[C];

Fig.3. VFI (Voting Feature Intervals) algorithm 3.4.2 VFI (Voting Feature Intervals) algorithm The process of VFI (Voting Feature Intervals) algorithm is given in Figure 3. It classifies by attribute-discretization: the algorithm first builds feature intervals for each class and attribute, then uses a voting strategy to assess its learning model. Entropy minimization is always used to create suitable intervals. 4 Insight on Obtained Result The results we get from the training models are very helpful in predicting potential buyers. Here we use the adaboost plus decision stump to illustrate it. We get a model by using adaboost plus decision stump. And the result shows the following things: - B_ENQ_L6M_GR3, CURR_RES_MTHS, AGE_AT_APPLICATION, B_ENQ_L6M_GR2, B_ENQ_L12M_GR2, ANNUAL_INCOME_RANGE CURRENT_EMPL_MTHS are the most important. - B_ENQ_L6M_GR3 is the most effective attribute to predict potential customer. When a instance s value of this attribute is bigger than 0, the instance is likely to be a positive one. This means the customers who had enquired about the mortgage at the bureau is a potential buyer. - A instance with the age bigger than 40. - A instance with the CURR_RES_MTH smaller than 30 months is more likely to be a positive one. This means a person who lived in his or her current house no longer than two and a half years is more likely to be a mortgage buyer. - The results also tell us that a instance with the B_ENQ_GR2 bigger than 0 or CURR_EMPL_MTHS smaller than 2 years is more likely to be a positive one., which means that a person who enquire for the loans or have not worked very long at current job is more likely to be a mortgage buyer. Besides, the results also show that people with higher annual income are more inclined to be a potential customer.

References 1. Demirz G, Gvenir, HA, Classification by Voting Feature Intervals, Proc. of Ninth ECML, Springer-Verlag, LNAI 1224, 1997:85-92. 2. DRUMMOND C,HOLTE R C.C4.5, class imbalance,and cost sensitivity:why under-sampling beats over-sampling, Proc of Learning from Imbalanced Datasets II.Washington DC,2003. 3. QUINLAN JR.Induction of Decision Tree, Machine Learning,1986,1(1):81-106. 4. Bradley, A. P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms 30(7) (1997) 1145-1159. 5. QUINLAN R J.C4.5: programs for machine learning, Seattle:Morgan Kaufman Publishers,1993. 6. WITTEN I H,FRANK E.Data ming:practical machine learning tools and techniques with Java implementations, Seattle:Morgan Kauman Publishers, 2000:265-314. 7. WEISS GM. Mining with rarity A unifying framework. Chicago, IL, USA, SIGKDD Explorations, 2004, 6(1) 7-19. 8. JOSHI M, KUMAR V, AGARWAL R. Evaluating Boosting Algorithms to Classify Rare Classes Comparison and Improvements, First IEEE International Conference on Data Mining. San Jose, CA, 2001. 9. HUANG KZ, YANG HQ, KING I, et al. Learning Classifiers from Imbalanced Data Based on Biased Minimax Probability Machine, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2004.