A Solution to PAKDD 07 Data Mining Competition Ye Wang, Bin Bi Under the supervision of: Dehong Qiu Abstract. This article presents a solution to the PAKDD 07 Data Mining competition. We mainly discuss the main challenge to this problem and our way to solve it. 1 Introduction The PAKDD 07 Data Mining Competition task is a cross-selling problem which is described as follows: A company currently has a customer base of credit card customers as well as a customer base of home loan (mortgage) customers. The company would like to make use of this opportunity to cross-sell home loans to its credit card customers, and the main difficulty is to develop an effective scoring model to predict the potential cross-selling take-ups. A modeling dataset of 40,700 customers with 40 modeling variables (as of the point of application for the company's credit card), plus a target variable, will be provided to the participants. This is a sample of customers who opened a new credit card with the company within a specific 2-year period and who did not have an existing home loan with the company. The target categorical variable "Target_Flag" will have a value of 1 if the customer then opened a home loan with the company within 12 months after opening the credit card (700 random samples), and will have a value of 0 if otherwise (40,000 random samples). A prediction dataset (8,000 sampled cases) will also be provided to the participants with similar variables but withholding the target variable. The data mining task is to produce a score for each customer in the prediction dataset, indicating a credit card customer's propensity to take up a home loan with the company (the higher the score, the higher the propensity). The accuracy of the results will be ranked in terms of AUC. This paper gives a solution to the crossing-selling problem, and the rest of the report is organized as follows: Section 2 discusses the main challenge while solving the problem. Section 3 proposes our method to the task. Section 4 shows some cues revealed by obtained result. 2 Understanding the problem We think the task is difficult because of the following: - The class imbalance. In the training set, the number of people with the
Target_Flag zero is more than fifty times than the number of people with the Target_Flag one. - The time-variant attributes: In the 40 attributes, there are several sequences of attributes which measure the same features in different time (i.e. the four attributes B_ENQ_LAST_WEEK, B-ENQ_L1M, B_ENQ_L3M, B_ENQ_L6M, they discover a trend of the customer s actions, how to exploit useful information form these sequences is a problem we must face.) - There are many unlabelled data, and to get some cues form them is also worth considering. 3 Solutions 3.1 Data preparation In this procedure, we went through standard series of data preparation including: - Partition train data into 80% learning set and 20% testing set - Watch the univariate distributions and frequency of each attributes - Data preprocessing is done by converting categorical values from literal string to integer index. - For missing values occur in the dataset, we replace them with a global value MISSING or simply remove the data fields including them for the reason that they give limited useful information. Other strategies such as statistical regression might be more effective. Due to the time constraint we do not have time to experiment with other methods. - We also note that some data fields show up as little influence to the result, (e.g. DVR_LIC), therefore, for these data fields, we believe that it can be safely removed from the data. 3.2 Resampling technology To solve the main problem, we mainly tried two ways the cost-sensitive learning and the resampling way. In the end, we thought the resampling is better than the cost-sensitive learning, as the training data set is too skew. According to the nature of this problem, we propose a technology which combines under-sampling method and over-sampling method. The method is described as following: 1. Denote dp(totally 700 instances) the positive subset(target_flag = 1) of the training dataset,denote dn(totally 40000 instances) the negative subset(target_flag = 0) of the training dataset 2. Get all instances (700) form dp, and copy them seven times, thus we get 4900 positive instances. Get 4000 random-selected instances form dn, and mix them with the 4900 positive instances. Put the 8900 instances in a new group. Remove the 4000 instances from dn. 3. If dn still have instances, back to Step 2, else process finishes. After this procedure, we actually get 10 groups, each contains 8900 instances. We do not make the number of positive instances a little more than the negative ones as
we want to get a classifier more inclined to classify instances as positive ones and we can get a higher AUC value. In practice it does work as we expected and its effect is also better than the one that has the same positive and negative instances. 3.3 Model Selection In choosing classifiers, we mainly tried three classifiers. First, we choose the C4.5 decision tree to model the problem, but we found the tree model have an over fitting problem which is very difficult to tackle. Then we tried the KNN, and we also found it not effective. Finally we used the logistic regression, decision stump + adaboost and VFI, they all yield good results on the data set. To get a more stable result, we decided to combine the three classifiers mentioned in the end of the last paragraph using the vote mechanism---we compute the average of the results got from the three classifiers. After all groups are processed, we combine them by computing the average of results got in each group. Figure 1 is a ROC curve we obtained when we were doing 10-fold validation. ROC Curve 1.0 0.8 Sensitivity 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity Fig.1. Finally, we use the rule selection to fine-tune the results. In data preparation, we find instances with certain values are almost impossible to have TARGET_FLAG equal to 1. We get the rules from the training dataset and reduce the probabilities of the instances which satisfy the rules. 3.4 Brief overview on technical details
3.4.1 Decision stump algorithm The process of decision stump algorithm is given in Figure 2. A decision stump can be denoted by (Z, c) where Z is a peak, selected from the p = 124 peaks, and c is a thresfold. This stump has two leaves, the left one contains the training sample with Decision stump algorithm: 1. In the training set count the number of examples in class C having value V for attribute A: store this information in a 3-d array, COUNT[C,V,A]. 2. The default class is the one having the most examples in the training set. The accuracy of the default class is the number of training examples in the default class divided by the total number of training examples. 3. FOR EACH NUMERICAL ATTRIBUTE, A: Create a nominal version of A by defining a finite number of intervals of values. These intervals become the "values" of the nominal version of A. definitions: Class C is optimal for attribute A, value V, if it maximizes COUNT[C,V,A]. Class C is optimal for attribute A, interval I, if it maximizes COUNT[C,"interval I",A]. Values are partitioned into intervals so that every interval satisfies the following constraints: (a) there is at least one class that is "optimal" for more than SMALL of the values in the interval. This constraint does not apply to the rightmost interval. (b) If V[I] is the smallest value for attribute A in the training set that is larger than the values in interval I then there is no class C that is optimal both for V[I] and for interval I. 4. FOR EACH ATTRIBUTE, A(use the nominal version of numerical attributes): (a) Construct a hypothesis involving attribute A by selecting, for each value V of A (and also for "missing"), an optimal class for V. If several classes are optimal for a value, choose among them randomly. (b) add the constructed hypothesis to a set called HYPOTHESES. This set will ultimately contain one hypothesis for each attribute. 5. 1R: choose the rule from the set HYPOTHESES having the highest accuracy on the training set (if there are several "best" rules, choose among them at random). 1R*: choose all the rules from HYPOTHESES having an accuracy on the training set greater than the accuracy of the default class. Fig.2. Decision stump algorithm intensity of peak Z less than or equal to the thresfold c, and the right leaf contains all other samples. If most the samples in the left leaf is, say a customer opened a home loan, then the samples with Z c will be classified as a customer opened a home
loan. The classifier is denoted as f(x), where x is a Boolean variable, Z c, and f(x) takes value from {-1, 1}: f(x i ) = 1 if the ith train sample is classified as class one; and f(x i ) = -1 if the ith sample is classified as class two. The sample is misclassified if y i f(x i ) = -1. For the left leaf, Z c, or x = true, let n 11 and n 21 be the numbers of observations with y i = 1 and y i = -1, respectively, i.e., n 11 = I{(y i = 1)&(Z c)}, n 21 = I{(y i = -1)&(Z c)}, (1) Where I{statement} is the indicator function, which equals 1 if the statement is true, 0 otherwise. Similarly, let n 12 and n 22 be the numbers of observations for y i = 1 and y i = -1, respectively, and Z > c, i.e., n 12 = I{(y i = 1)&(Z > c)}, n 22 = I{(y i = -1)&(Z > c)}, (2) Then the log likelihood for this multinomial model is log L = n uv log(p uv ), (3) Where p uv is evaluated by n uv / (n 1v + n 2v ). The peak Z and its threshold c are obtained by maximizing the log likelihood. VFI (Voting Feature Intervals) algorithm: Train(TrainingSet) FOR EACH FEATURE F EndPoints[F] = EndPoints[F] find_end_points(trainingset, F, C); Sort(EndPoints[F]); /* each pair of consecutive points in EndPoints[F] form a featue interval */ FOR EACH INTERVAL I /* on feature F */ /* count the number of instances of class C falling into interval i */ interval_class_count[f, I, C] = count_instances(f, I, C); Classify(e) /* e: example to be classified */ vote[c] = 0; FOR EACH FEATURE F feature_vote[f, C] = 0; /* vote of feature F for class C */ IF Ef value is known I = find_interval(f, Ef) feature_vote[f, C] = interval_class_count(f, I, C) / class_count(c); normalize_feature_votes(f); vote[c] = vote[c] + feature_vote[f, C]; RETURN CLASS C WITH HIGHEST VOTE[C];
Fig.3. VFI (Voting Feature Intervals) algorithm 3.4.2 VFI (Voting Feature Intervals) algorithm The process of VFI (Voting Feature Intervals) algorithm is given in Figure 3. It classifies by attribute-discretization: the algorithm first builds feature intervals for each class and attribute, then uses a voting strategy to assess its learning model. Entropy minimization is always used to create suitable intervals. 4 Insight on Obtained Result The results we get from the training models are very helpful in predicting potential buyers. Here we use the adaboost plus decision stump to illustrate it. We get a model by using adaboost plus decision stump. And the result shows the following things: - B_ENQ_L6M_GR3, CURR_RES_MTHS, AGE_AT_APPLICATION, B_ENQ_L6M_GR2, B_ENQ_L12M_GR2, ANNUAL_INCOME_RANGE CURRENT_EMPL_MTHS are the most important. - B_ENQ_L6M_GR3 is the most effective attribute to predict potential customer. When a instance s value of this attribute is bigger than 0, the instance is likely to be a positive one. This means the customers who had enquired about the mortgage at the bureau is a potential buyer. - A instance with the age bigger than 40. - A instance with the CURR_RES_MTH smaller than 30 months is more likely to be a positive one. This means a person who lived in his or her current house no longer than two and a half years is more likely to be a mortgage buyer. - The results also tell us that a instance with the B_ENQ_GR2 bigger than 0 or CURR_EMPL_MTHS smaller than 2 years is more likely to be a positive one., which means that a person who enquire for the loans or have not worked very long at current job is more likely to be a mortgage buyer. Besides, the results also show that people with higher annual income are more inclined to be a potential customer.
References 1. Demirz G, Gvenir, HA, Classification by Voting Feature Intervals, Proc. of Ninth ECML, Springer-Verlag, LNAI 1224, 1997:85-92. 2. DRUMMOND C,HOLTE R C.C4.5, class imbalance,and cost sensitivity:why under-sampling beats over-sampling, Proc of Learning from Imbalanced Datasets II.Washington DC,2003. 3. QUINLAN JR.Induction of Decision Tree, Machine Learning,1986,1(1):81-106. 4. Bradley, A. P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms 30(7) (1997) 1145-1159. 5. QUINLAN R J.C4.5: programs for machine learning, Seattle:Morgan Kaufman Publishers,1993. 6. WITTEN I H,FRANK E.Data ming:practical machine learning tools and techniques with Java implementations, Seattle:Morgan Kauman Publishers, 2000:265-314. 7. WEISS GM. Mining with rarity A unifying framework. Chicago, IL, USA, SIGKDD Explorations, 2004, 6(1) 7-19. 8. JOSHI M, KUMAR V, AGARWAL R. Evaluating Boosting Algorithms to Classify Rare Classes Comparison and Improvements, First IEEE International Conference on Data Mining. San Jose, CA, 2001. 9. HUANG KZ, YANG HQ, KING I, et al. Learning Classifiers from Imbalanced Data Based on Biased Minimax Probability Machine, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2004.