RBM-SMOTE: Restricted Boltzmann Machines for Synthetic Minority Oversampling Technique

Size: px

Start display at page:

Download "RBM-SMOTE: Restricted Boltzmann Machines for Synthetic Minority Oversampling Technique"

Wendy Bryan
6 years ago
Views:

1 RBM-SMOTE: Restricted Boltzmann Machines for Synthetic Minority Oversampling Technique Maciej Zięba, Jakub M. Tomczak, and Adam Gonczarek Faculty of Computer Science and Management, Wroclaw University of Technology, Wybrzeże Wyspiańskiego 27, Wroclaw, Poland Tel.: , Fax: Abstract. The problem of imbalanced data, i.e., when the class labels are unequally distributed, is encountered in many real-life application, e.g., credit scoring, medical diagnostics. Various approaches aimed at dealing with the imbalanced data have been proposed. One of the most well known data pre-processing method is the Synthetic Minority Oversampling Technique (SMOTE). However, SMOTE may generate examples which are artificial in the sense that they are impossible to be drawn from the true distribution. Therefore, in this paper, we propose to apply Restricted Boltzmann Machine to learn an intermediate representation which transform the SMOTE examples to the ones approximately drawn from the true distribution. At the end of the paper we perform an experiment using credit scoring dataset. Keywords: imbalanced data, oversampling, SMOTE, RBM 1 Introduction The problem of imbalanced data became one of the key issues in the process of training a classification model [7]. A dataset is considered to be imbalanced if the class labels are strongly unequally distributed. Therefore, learning from the imbalanced data may have a negative impact on training the model which is biased toward the majority class. Recently, numerous approaches were proposed to deal with that issue. In general, they can be roughly divided into two groups: external and internal methods. External approaches aim at sampling examples in order to balance the training set. There are several oversampling methods such as Synthetic Minority Oversampling Technique (SMOTE) [2] or its extensions, e.g., Borderline SMOTE [11], LN-SMOTE [13], SMOTE-RSB [15], Safe-Level-SMOTE [1], or recently introduced SMOTE-IPF [16]. Among undersampling techniques we can distinguish methods that make use of K -NN classifier to identify relevant instances in majority class [14], use evolutionary algorithms to balance the data [6] or make use mutual neighborhood relation called Tomek link [21].

2 In internal approaches the balancing techniques are incorporated in the training process of a classifier. Typically, ensemble classifiers are adjusted to deal with the imbalanced data by either making use of oversampling techniques to diversify the base learners, such as SMOTEBoost [3], SMOTEBagging [22], RAMOBoost [4], or by performing undersampling before creating each of the component classifier, e.g., UnderBagging [20], Roughly Balanced Bagging [8], RUSBoost [18]. Beside ensemble-based approaches there are other internal balancing approaches, e.g., active learning strategies [5], granular computing [19]. In this paper, we propose an extension of SMOTE where artificial examples generated by SMOTE are projected onto the manifold of intermediate representation and then projected back to the input space. This extension results in generating new examples which are expected to be approximately drawn from the true distribution. In order to learn the manifold of the intermediate representation we propose to make use of Restricted Boltzmann Machines (RBM) [9]. RBM is usually used in feature extraction, classification [12] or collaborative filtering [17], among others. The idea of our approach is to construct artificial examples using SMOTE first, and then perform Gibbs sampling with RBM model trained using all minority examples to obtain new sample. In other words, the SMOTE-based sample is a starting point for sampling from RBM model. The paper is organized as follows. In Section 2 RBM-SMOTE model for creating artificial samples is proposed. In Section 3 we present experimental results of the proposed approach tested on Kaggle Give Me Some Credit 1 dataset. This work is summarized by some conclusions and future works in Section 4. 2 Methodology 2.1 SMOTE The SMOTE procedure is one of the most popular oversampling method for coping with the imbalanced data phenomenon. This approach generates artificial examples located on the path connecting selected minority example and one of its closest neighbor. The number of examples we want to sample is set by the parameter P SMOT E. In SMOTE we pick randomly without replacement examples from minority class. The procedure of generating artificial sample is described in Algorithm 1. Let us denote the dataset by D N = {(x n, y n )} N n=1, where x n is a vector of features describing n-th example and y n is the corresponding class label, y n { 1, 1}, 1 represents positive (minority) and 1 negative (majority) class. 2 In the SMOTE algorithm the selected minority example x i is taken as an input as well as the entire training data D N to achieve a new artificial example x i. In the first step we randomly select one of the K nearest neighbors of the example x i (see Figure 1a). 1 Kaggle Give Me Some Credit is available on-line 2 In the literature it is assumed that majority class is positive and minority class is negative.

3 Further, the random value r is generated to set the location of the new example x i on the path connecting two points: x i and x j (see Figure 1b). Finally, the position of the new artificial example x i is calculated. Algorithm 1: Creating artificial sample with SMOTE Input : D N = {(x n, y n)} N n=1: training set, x i: selected minority example, K: number of nearest neighbors Output: x i: artificial example. 1 Select k uniformly from {1,..., K}; 2 Find x j, k-th nearest neighbor of x i in D N ; 3 Sample r from [0, 1]; 4 x i x i + r (x j x i); (a) (b) Fig. 1: Example of SMOTE for single example: a) selected example and 3 nearest neighbors (black circles), b) artificial example created with SMOTE. The presented version of SMOTE algorithm is dedicated to operate on realvalued features. However, it is easy to extend SMOTE to construct the generator of binary features. If we take under consideration x i and x j containing only binary values we may receive artificial example x i containing real values. However, x i may be used as a distribution of multiple Bernoullis to sample a binary vector. The detailed description of SMOTE with additional examples is presented in [2]. The main drawback of SMOTE sampling is the fact that most of the created examples are impossible to be observed in real data. The generated examples may be located far from the the true distribution. As a consequence learning a model on such generated artificial data leads to estimates biased by the noise incorporated in the newly created examples. For instance, consider the handwrit-

4 ten digits taken from MNIST dataset 3 presented in Figure 2. The SMOTE-based examples in most cases are far from the digits that can be written by a human. EXAMPLE 1 EXAMPLE 2 SMOTE SMOTE RBM Fig. 2: (Two top rows) Pairs of examples taken from MNIST dataset that are used to generate artificial sample with SMOTE. (Third row) The artificial examples sampled using SMOTE and the two real examples. (Bottom row) Examples sampled using SMOTE and then transformed using RBM. 2.2 RBM In this paper, we propose to apply Restricted Boltzmann Machine (RBM) to adjust the artificial example sampled using SMOTE to the one which is approximately drawn from the true distribution. We make use of the SMOTE examples as a good starting points for Gibbs sampling from RBM model trained on minority class cases. Considering the example in Figure 2 after applying Gibbs sampling to the artificial digit generated by SMOTE we obtain objects which are easier to interpret. Restricted Boltzmann Machine is a bipartie Markov Random Field in which visible and hidden units can be distinguished. In RBM only connections between the units in different layers are allowed, i.e., visible to hidden units. The joint distribution of binary visible and hidden units is the Gibbs distribution: with the following energy function: p(x, h θ) = 1 Z(θ) exp( E(x, h θ) ), (1) E(x, h θ) = x Wh b x c h, (2) 3 The MNIST is available on the Web page:

5 where x {0, 1} D are the visible units, h {0, 1} M are the hidden units, Z(θ) is the normalizing constant dependent on θ, and θ = {W, b, c} is the set of parameters, namely, W R D M, b R D, c R M are the weight matrix, visible and hidden bias vectors, respectively. Since there are no connections among the units within the same layer, i.e., neither visible to visible, nor hidden to hidden connections, the visible units are conditionally independent given the hidden units and vice versa: p(x i = 1 h, W, b) = sigm ( W i h + b i ), (3) p(h j = 1 x, W, c) = sigm ( (W j ) x + c j ), (4) where sigm(a) = 1 1+exp( a) is the sigmoid function, W i is the i-th row of the weight matrix, and W j is the j-th column of the weight matrix. Therefore, the conditional probability distributions can be presented in the following manner: p(x h, W, b) = p(h x, W, c) = D p(x i h, W, b), (5) i=1 M p(h j x, W, c). (6) j=1 Unfortunately, in order to learn parameters θ gradient-based optimization methods cannot be directly applied because exact gradient calculation is intractable analytically. However, we can adopt Constrastive Divergence algorithm which approximates exact gradient using sampling methods [10]. To train RBM we consider minimization of the negative log-likelihood in the following form: N L(θ) = log p(x n θ). (7) n=1 Further, to prevent the model from overfitting, additional regularization term can be added to the learning objective: L Ω (θ) = L(θ) + λω(θ), (8) where λ > 0 is the regularization coefficient, and Ω(θ) is the regularization term. In the following, we use the weight decay regularization, i.e., Ω(θ) = W F, where F is the Frobenius norm. 2.3 RBM-SMOTE We have introduced SMOTE and RBM and therefore we can formulate our new oversampling scheme. The procedure of generating artificial examples is presented in Algorithm 2. First, the set of artificial examples X SMOT E is generated using SMOTE procedure. Next, the RBM model is trained using only minority

6 examples D + N from training data D N. Furthermore, for each artificial example taken from X SMOT E we perform K G iterations of Gibbs sampling using trained RBM model. The generated example x n is included in the final set of the examples X SMOT E that is returned by the procedure. One step of the procedure is also presented in Figure 3. Algorithm 2: Creating artificial samples with SMOTE together with application of RBM model. Input : D N : training set, K G: number of Gibbs sampling iterations, K: number of nearest neighbours (SMOTE),P SMOT E: percentage of artificial examples (SMOTE). Output: X SMOT E: set of generated artificial samples. 1 Set X SMOT E = ; 2 Generate the set of artificial samples X SMOT E by application of SMOTE procedure on D N with parameters K and N SMOT E; 3 Estimate θ = {W, b, c} by training RBM model on positive (minority) examples, i. e., D + N, where D+ N = {(xn, yn) DN : yn = 1}; 4 foreach x n X SMOT E do 5 Set x n = x n; 6 for k = 1 K G do 7 Sample h n from p(h x n, W, c) (see eq. (6)); 8 Sample x n from p(x h n, W, b) (see eq. (5)); 9 end 10 Add x n to X SMOT E; 11 end Fig. 3: Graphical interpretation of one step of the Algorithm 2. Circles represent observations, the triangle denote new example generated by SMOTE, and the rectangle is the SMOTE example reconstructed using RBM.

7 In the presented procedure we make use of artificial examples generated using SMOTE as a good starting points for Gibbs sampling procedure that is performed by trained RBM model. As a consequence, we achieve the examples that are significantly closer to the true distribution than SMOTE-based examples. The empirical studies show that even one loop of sampling procedure may result in constructing good-quality artificial examples. 3 Experiment Dataset The proposed solution was tested on Kaggle Give Me Some Credit data with the vector of the attributes transformed to the binary inputs. Each of the instances is described by 59 binary features. The considered dataset is highly influenced by the imbalanced data phenomenon with the imbalance ratio 4 equal Methodology The goal of the experiment was to compare the performance of simple SMOTE with the same sampling method with RBM modification (further named RBM-SMOTE). As an evaluation criterion we chose Gmean, which is defined as a square root of the product of True Positive Rate (TPR, called Sensitivity) 5 : T P T P R = T P + F N, (9) and the True Negative Rate (TNR, called Specificity): T NR = T N T N + F P. (10) This criterion is widely used to evaluate the quality of the classifiers trained on the highly imbalanced data. We also analyzed the criterion of area under ROC curve (AUC) that can be expressed as an arithmetic mean of TPR and TNR. We compared the performance of SMOTE and RBM-SMOTE using the classifiers that are typically applied to the domain of credit risk evaluation, i.e., two decision trees (JRip and CART ), Logistic Regression (Log) and other typically used classifiers such as K-nearest neighbors (KNN ), Naïve Bayes (NB), Bagging (Bag), AdaBoost (AdaB), Random Forest (RF ), LogitBoost (LogitB) and Multilayer Perceptron (MLP). For each experiment we used 90% of dataset for training and remaining 10% for testing. 6 For both methods the percentage of artificial examples was set to 1400%. The RBM model was trained using Contrastive Divergence procedure with error rate equal The weight decay regularization was applied with regularization coefficient equal The value of regularization coefficient was set basing on results of preliminary experiments. 4 The ratio between negatives and positives. 5 TP,TN,FP,FN are the elements of the confusion matrix. 6 Due to large number of examples taken under consideration in the experiment it was unnecessary to apply other testing methodologies like cross-validation.

8 Table 1: The results of the experiment obtained on Kaggle Give Me Some Credit data. Log J48 CART KNN NB Bag AdaB RF LogitB MLP None TPR SMOTE RSMOTE None TNR SMOTE RSMOTE None Gmean SMOTE RSMOTE None AUC SMOTE RSMOTE Results The results are presented in Table 1. It can be noticed that if no oversampling method is applied the values of TPR are close to 0. Comparing our approach and the SMOTE we can observe that RBM-SMOTE outperforms simple SMOTE on all classification methods (see Gmean and AUC in Table 1). The differences in results are especially visible for comprehensible models (J48, CART, RF ). It is important to highlight that our solution operates noticeably better in detecting positive (minority) examples (see TPR values in Table 1) for most of the classifiers considered in the experimental studies. It is extremely important in the context of the considered credit scoring problem where the minority class stays behind the group of consumers that are unable to repay their financial liabilities. 4 Conclusion and future work In this paper, we present novel oversampling technique that makes use of RBM model to adjust the examples created with SMOTE to the true distribution over binary features. As a consequence, the artificial examples are expected to be approximately drawn from the true distribution. The results of the preliminary experiments performed on the selected dataset are promising in the context of more thorough analysis of the proposed solution. For the future works we plan to evaluate the quality of the proposed solution on the large number of datasets taken from various domains. We would like also to extend the presented approach to the numerical features making the assumption that visible units are modeled with Gaussian distribution. Additionally, we consider to compare results gained by RBM-SMOTE with other SMOTE-based solutions (e.g. [16]).

9 Acknowledgments The research conducted by the authors has been partially co-financed by the Ministry of Science and Higher Education, Republic of Poland, namely, Maciej Zięba: grant No. B40242/I32, Jakub M. Tomczak: grant No. B40020/I32, Adam Gonczarek: grant No. B40235/I32. The work conducted by Maciej Zięba is also co-financed by the European Union within the European Social Fund. References 1. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: Safelevel-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Advances in Knowledge Discovery and Data Mining, pp Springer (2009) 2. Chawla, N.V., Bowyer, K.W., Hall, L.O.: SMOTE : Synthetic Minority Oversampling TEchnique. Journal of Artificial Intelligence Research 16, (2002) 3. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.: SMOTEBoost: Improving prediction of the minority class in boosting. In: In Proceedings of the Principles of Knowledge Discovery in Databases, PKDD pp Springer (2003) 4. Chen, S., He, H., Garcia, E.: RAMOBoost: Ranked minority oversampling in boosting. Neural Networks, IEEE Transactions on 21(10), (2010) 5. Ertekin, S., Huang, J., Giles, C.: Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. pp ACM (2007) 6. García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Applied Soft Computing 9(4), (2009) 7. He, H., Garcia, E.A.: Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9), (Sep 2009) 8. Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining 2(5-6), (2009) 9. Hinton, G.: A practical guide to training restricted boltzmann machines. Momentum 9(1), 926 (2010) 10. Hinton, G.E.: A practical guide to training restricted boltzmann machines. In: Neural Networks: Tricks of the Trade, pp Springer (2012) 11. Hui, H., Wang, W., Mao, B.: Borderline-SMOTE : A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Advances in Intelligent Computing, pp (2005) 12. Larochelle, H., Bengio, Y.: Classification using discriminative restricted boltzmann machines. In: Proceedings of the 25th international conference on Machine learning. pp ACM (2008) 13. Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of smote for mining imbalanced data. In: Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on. pp IEEE (2011) 14. Mani, J., Zhang, I.: KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In: Proceedings of International Conference on Machine Learning, Workshop Learning from Imbalanced Data Sets (2003)

10 15. Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowledge and information systems 33(2), (2012) 16. Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: SMOTE IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a resampling method with filtering. Information Sciences (2014) 17. Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted boltzmann machines for collaborative filtering. In: Proceedings of the 24th international conference on Machine learning. pp ACM (2007) 18. Seiffert, C., Khoshgoftaar, T., Van Hulse, J., Napolitano, A.: RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 40(1), (2010) 19. Tang, Y., Zhang, Y., Huang, Z.: Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics 4(3), (2007) 20. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), (2006) 21. Tomek, I.: Two Modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics 6(11), (Nov 1976) 22. Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: IEEE Symposium on Computational Intelligence and Data Mining. pp IEEE (2009)

Package ebmc. August 29, 2017

Package ebmc. August 29, 2017 Type Package Package ebmc August 29, 2017 Title Ensemble-Based Methods for Class Imbalance Problem Version 1.0.0 Author Hsiang Hao, Chen Maintainer ``Hsiang Hao, Chen'' Four ensemble-based