RBM-SMOTE: Restricted Boltzmann Machines for Synthetic Minority Oversampling Technique
|
|
- Wendy Bryan
- 6 years ago
- Views:
Transcription
1 RBM-SMOTE: Restricted Boltzmann Machines for Synthetic Minority Oversampling Technique Maciej Zięba, Jakub M. Tomczak, and Adam Gonczarek Faculty of Computer Science and Management, Wroclaw University of Technology, Wybrzeże Wyspiańskiego 27, Wroclaw, Poland Tel.: , Fax: Abstract. The problem of imbalanced data, i.e., when the class labels are unequally distributed, is encountered in many real-life application, e.g., credit scoring, medical diagnostics. Various approaches aimed at dealing with the imbalanced data have been proposed. One of the most well known data pre-processing method is the Synthetic Minority Oversampling Technique (SMOTE). However, SMOTE may generate examples which are artificial in the sense that they are impossible to be drawn from the true distribution. Therefore, in this paper, we propose to apply Restricted Boltzmann Machine to learn an intermediate representation which transform the SMOTE examples to the ones approximately drawn from the true distribution. At the end of the paper we perform an experiment using credit scoring dataset. Keywords: imbalanced data, oversampling, SMOTE, RBM 1 Introduction The problem of imbalanced data became one of the key issues in the process of training a classification model [7]. A dataset is considered to be imbalanced if the class labels are strongly unequally distributed. Therefore, learning from the imbalanced data may have a negative impact on training the model which is biased toward the majority class. Recently, numerous approaches were proposed to deal with that issue. In general, they can be roughly divided into two groups: external and internal methods. External approaches aim at sampling examples in order to balance the training set. There are several oversampling methods such as Synthetic Minority Oversampling Technique (SMOTE) [2] or its extensions, e.g., Borderline SMOTE [11], LN-SMOTE [13], SMOTE-RSB [15], Safe-Level-SMOTE [1], or recently introduced SMOTE-IPF [16]. Among undersampling techniques we can distinguish methods that make use of K -NN classifier to identify relevant instances in majority class [14], use evolutionary algorithms to balance the data [6] or make use mutual neighborhood relation called Tomek link [21].
2 In internal approaches the balancing techniques are incorporated in the training process of a classifier. Typically, ensemble classifiers are adjusted to deal with the imbalanced data by either making use of oversampling techniques to diversify the base learners, such as SMOTEBoost [3], SMOTEBagging [22], RAMOBoost [4], or by performing undersampling before creating each of the component classifier, e.g., UnderBagging [20], Roughly Balanced Bagging [8], RUSBoost [18]. Beside ensemble-based approaches there are other internal balancing approaches, e.g., active learning strategies [5], granular computing [19]. In this paper, we propose an extension of SMOTE where artificial examples generated by SMOTE are projected onto the manifold of intermediate representation and then projected back to the input space. This extension results in generating new examples which are expected to be approximately drawn from the true distribution. In order to learn the manifold of the intermediate representation we propose to make use of Restricted Boltzmann Machines (RBM) [9]. RBM is usually used in feature extraction, classification [12] or collaborative filtering [17], among others. The idea of our approach is to construct artificial examples using SMOTE first, and then perform Gibbs sampling with RBM model trained using all minority examples to obtain new sample. In other words, the SMOTE-based sample is a starting point for sampling from RBM model. The paper is organized as follows. In Section 2 RBM-SMOTE model for creating artificial samples is proposed. In Section 3 we present experimental results of the proposed approach tested on Kaggle Give Me Some Credit 1 dataset. This work is summarized by some conclusions and future works in Section 4. 2 Methodology 2.1 SMOTE The SMOTE procedure is one of the most popular oversampling method for coping with the imbalanced data phenomenon. This approach generates artificial examples located on the path connecting selected minority example and one of its closest neighbor. The number of examples we want to sample is set by the parameter P SMOT E. In SMOTE we pick randomly without replacement examples from minority class. The procedure of generating artificial sample is described in Algorithm 1. Let us denote the dataset by D N = {(x n, y n )} N n=1, where x n is a vector of features describing n-th example and y n is the corresponding class label, y n { 1, 1}, 1 represents positive (minority) and 1 negative (majority) class. 2 In the SMOTE algorithm the selected minority example x i is taken as an input as well as the entire training data D N to achieve a new artificial example x i. In the first step we randomly select one of the K nearest neighbors of the example x i (see Figure 1a). 1 Kaggle Give Me Some Credit is available on-line 2 In the literature it is assumed that majority class is positive and minority class is negative.
3 Further, the random value r is generated to set the location of the new example x i on the path connecting two points: x i and x j (see Figure 1b). Finally, the position of the new artificial example x i is calculated. Algorithm 1: Creating artificial sample with SMOTE Input : D N = {(x n, y n)} N n=1: training set, x i: selected minority example, K: number of nearest neighbors Output: x i: artificial example. 1 Select k uniformly from {1,..., K}; 2 Find x j, k-th nearest neighbor of x i in D N ; 3 Sample r from [0, 1]; 4 x i x i + r (x j x i); (a) (b) Fig. 1: Example of SMOTE for single example: a) selected example and 3 nearest neighbors (black circles), b) artificial example created with SMOTE. The presented version of SMOTE algorithm is dedicated to operate on realvalued features. However, it is easy to extend SMOTE to construct the generator of binary features. If we take under consideration x i and x j containing only binary values we may receive artificial example x i containing real values. However, x i may be used as a distribution of multiple Bernoullis to sample a binary vector. The detailed description of SMOTE with additional examples is presented in [2]. The main drawback of SMOTE sampling is the fact that most of the created examples are impossible to be observed in real data. The generated examples may be located far from the the true distribution. As a consequence learning a model on such generated artificial data leads to estimates biased by the noise incorporated in the newly created examples. For instance, consider the handwrit-
4 ten digits taken from MNIST dataset 3 presented in Figure 2. The SMOTE-based examples in most cases are far from the digits that can be written by a human. EXAMPLE 1 EXAMPLE 2 SMOTE SMOTE RBM Fig. 2: (Two top rows) Pairs of examples taken from MNIST dataset that are used to generate artificial sample with SMOTE. (Third row) The artificial examples sampled using SMOTE and the two real examples. (Bottom row) Examples sampled using SMOTE and then transformed using RBM. 2.2 RBM In this paper, we propose to apply Restricted Boltzmann Machine (RBM) to adjust the artificial example sampled using SMOTE to the one which is approximately drawn from the true distribution. We make use of the SMOTE examples as a good starting points for Gibbs sampling from RBM model trained on minority class cases. Considering the example in Figure 2 after applying Gibbs sampling to the artificial digit generated by SMOTE we obtain objects which are easier to interpret. Restricted Boltzmann Machine is a bipartie Markov Random Field in which visible and hidden units can be distinguished. In RBM only connections between the units in different layers are allowed, i.e., visible to hidden units. The joint distribution of binary visible and hidden units is the Gibbs distribution: with the following energy function: p(x, h θ) = 1 Z(θ) exp( E(x, h θ) ), (1) E(x, h θ) = x Wh b x c h, (2) 3 The MNIST is available on the Web page:
5 where x {0, 1} D are the visible units, h {0, 1} M are the hidden units, Z(θ) is the normalizing constant dependent on θ, and θ = {W, b, c} is the set of parameters, namely, W R D M, b R D, c R M are the weight matrix, visible and hidden bias vectors, respectively. Since there are no connections among the units within the same layer, i.e., neither visible to visible, nor hidden to hidden connections, the visible units are conditionally independent given the hidden units and vice versa: p(x i = 1 h, W, b) = sigm ( W i h + b i ), (3) p(h j = 1 x, W, c) = sigm ( (W j ) x + c j ), (4) where sigm(a) = 1 1+exp( a) is the sigmoid function, W i is the i-th row of the weight matrix, and W j is the j-th column of the weight matrix. Therefore, the conditional probability distributions can be presented in the following manner: p(x h, W, b) = p(h x, W, c) = D p(x i h, W, b), (5) i=1 M p(h j x, W, c). (6) j=1 Unfortunately, in order to learn parameters θ gradient-based optimization methods cannot be directly applied because exact gradient calculation is intractable analytically. However, we can adopt Constrastive Divergence algorithm which approximates exact gradient using sampling methods [10]. To train RBM we consider minimization of the negative log-likelihood in the following form: N L(θ) = log p(x n θ). (7) n=1 Further, to prevent the model from overfitting, additional regularization term can be added to the learning objective: L Ω (θ) = L(θ) + λω(θ), (8) where λ > 0 is the regularization coefficient, and Ω(θ) is the regularization term. In the following, we use the weight decay regularization, i.e., Ω(θ) = W F, where F is the Frobenius norm. 2.3 RBM-SMOTE We have introduced SMOTE and RBM and therefore we can formulate our new oversampling scheme. The procedure of generating artificial examples is presented in Algorithm 2. First, the set of artificial examples X SMOT E is generated using SMOTE procedure. Next, the RBM model is trained using only minority
6 examples D + N from training data D N. Furthermore, for each artificial example taken from X SMOT E we perform K G iterations of Gibbs sampling using trained RBM model. The generated example x n is included in the final set of the examples X SMOT E that is returned by the procedure. One step of the procedure is also presented in Figure 3. Algorithm 2: Creating artificial samples with SMOTE together with application of RBM model. Input : D N : training set, K G: number of Gibbs sampling iterations, K: number of nearest neighbours (SMOTE),P SMOT E: percentage of artificial examples (SMOTE). Output: X SMOT E: set of generated artificial samples. 1 Set X SMOT E = ; 2 Generate the set of artificial samples X SMOT E by application of SMOTE procedure on D N with parameters K and N SMOT E; 3 Estimate θ = {W, b, c} by training RBM model on positive (minority) examples, i. e., D + N, where D+ N = {(xn, yn) DN : yn = 1}; 4 foreach x n X SMOT E do 5 Set x n = x n; 6 for k = 1 K G do 7 Sample h n from p(h x n, W, c) (see eq. (6)); 8 Sample x n from p(x h n, W, b) (see eq. (5)); 9 end 10 Add x n to X SMOT E; 11 end Fig. 3: Graphical interpretation of one step of the Algorithm 2. Circles represent observations, the triangle denote new example generated by SMOTE, and the rectangle is the SMOTE example reconstructed using RBM.
7 In the presented procedure we make use of artificial examples generated using SMOTE as a good starting points for Gibbs sampling procedure that is performed by trained RBM model. As a consequence, we achieve the examples that are significantly closer to the true distribution than SMOTE-based examples. The empirical studies show that even one loop of sampling procedure may result in constructing good-quality artificial examples. 3 Experiment Dataset The proposed solution was tested on Kaggle Give Me Some Credit data with the vector of the attributes transformed to the binary inputs. Each of the instances is described by 59 binary features. The considered dataset is highly influenced by the imbalanced data phenomenon with the imbalance ratio 4 equal Methodology The goal of the experiment was to compare the performance of simple SMOTE with the same sampling method with RBM modification (further named RBM-SMOTE). As an evaluation criterion we chose Gmean, which is defined as a square root of the product of True Positive Rate (TPR, called Sensitivity) 5 : T P T P R = T P + F N, (9) and the True Negative Rate (TNR, called Specificity): T NR = T N T N + F P. (10) This criterion is widely used to evaluate the quality of the classifiers trained on the highly imbalanced data. We also analyzed the criterion of area under ROC curve (AUC) that can be expressed as an arithmetic mean of TPR and TNR. We compared the performance of SMOTE and RBM-SMOTE using the classifiers that are typically applied to the domain of credit risk evaluation, i.e., two decision trees (JRip and CART ), Logistic Regression (Log) and other typically used classifiers such as K-nearest neighbors (KNN ), Naïve Bayes (NB), Bagging (Bag), AdaBoost (AdaB), Random Forest (RF ), LogitBoost (LogitB) and Multilayer Perceptron (MLP). For each experiment we used 90% of dataset for training and remaining 10% for testing. 6 For both methods the percentage of artificial examples was set to 1400%. The RBM model was trained using Contrastive Divergence procedure with error rate equal The weight decay regularization was applied with regularization coefficient equal The value of regularization coefficient was set basing on results of preliminary experiments. 4 The ratio between negatives and positives. 5 TP,TN,FP,FN are the elements of the confusion matrix. 6 Due to large number of examples taken under consideration in the experiment it was unnecessary to apply other testing methodologies like cross-validation.
8 Table 1: The results of the experiment obtained on Kaggle Give Me Some Credit data. Log J48 CART KNN NB Bag AdaB RF LogitB MLP None TPR SMOTE RSMOTE None TNR SMOTE RSMOTE None Gmean SMOTE RSMOTE None AUC SMOTE RSMOTE Results The results are presented in Table 1. It can be noticed that if no oversampling method is applied the values of TPR are close to 0. Comparing our approach and the SMOTE we can observe that RBM-SMOTE outperforms simple SMOTE on all classification methods (see Gmean and AUC in Table 1). The differences in results are especially visible for comprehensible models (J48, CART, RF ). It is important to highlight that our solution operates noticeably better in detecting positive (minority) examples (see TPR values in Table 1) for most of the classifiers considered in the experimental studies. It is extremely important in the context of the considered credit scoring problem where the minority class stays behind the group of consumers that are unable to repay their financial liabilities. 4 Conclusion and future work In this paper, we present novel oversampling technique that makes use of RBM model to adjust the examples created with SMOTE to the true distribution over binary features. As a consequence, the artificial examples are expected to be approximately drawn from the true distribution. The results of the preliminary experiments performed on the selected dataset are promising in the context of more thorough analysis of the proposed solution. For the future works we plan to evaluate the quality of the proposed solution on the large number of datasets taken from various domains. We would like also to extend the presented approach to the numerical features making the assumption that visible units are modeled with Gaussian distribution. Additionally, we consider to compare results gained by RBM-SMOTE with other SMOTE-based solutions (e.g. [16]).
9 Acknowledgments The research conducted by the authors has been partially co-financed by the Ministry of Science and Higher Education, Republic of Poland, namely, Maciej Zięba: grant No. B40242/I32, Jakub M. Tomczak: grant No. B40020/I32, Adam Gonczarek: grant No. B40235/I32. The work conducted by Maciej Zięba is also co-financed by the European Union within the European Social Fund. References 1. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: Safelevel-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Advances in Knowledge Discovery and Data Mining, pp Springer (2009) 2. Chawla, N.V., Bowyer, K.W., Hall, L.O.: SMOTE : Synthetic Minority Oversampling TEchnique. Journal of Artificial Intelligence Research 16, (2002) 3. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.: SMOTEBoost: Improving prediction of the minority class in boosting. In: In Proceedings of the Principles of Knowledge Discovery in Databases, PKDD pp Springer (2003) 4. Chen, S., He, H., Garcia, E.: RAMOBoost: Ranked minority oversampling in boosting. Neural Networks, IEEE Transactions on 21(10), (2010) 5. Ertekin, S., Huang, J., Giles, C.: Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. pp ACM (2007) 6. García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Applied Soft Computing 9(4), (2009) 7. He, H., Garcia, E.A.: Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9), (Sep 2009) 8. Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining 2(5-6), (2009) 9. Hinton, G.: A practical guide to training restricted boltzmann machines. Momentum 9(1), 926 (2010) 10. Hinton, G.E.: A practical guide to training restricted boltzmann machines. In: Neural Networks: Tricks of the Trade, pp Springer (2012) 11. Hui, H., Wang, W., Mao, B.: Borderline-SMOTE : A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Advances in Intelligent Computing, pp (2005) 12. Larochelle, H., Bengio, Y.: Classification using discriminative restricted boltzmann machines. In: Proceedings of the 25th international conference on Machine learning. pp ACM (2008) 13. Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of smote for mining imbalanced data. In: Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on. pp IEEE (2011) 14. Mani, J., Zhang, I.: KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In: Proceedings of International Conference on Machine Learning, Workshop Learning from Imbalanced Data Sets (2003)
10 15. Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowledge and information systems 33(2), (2012) 16. Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: SMOTE IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a resampling method with filtering. Information Sciences (2014) 17. Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted boltzmann machines for collaborative filtering. In: Proceedings of the 24th international conference on Machine learning. pp ACM (2007) 18. Seiffert, C., Khoshgoftaar, T., Van Hulse, J., Napolitano, A.: RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 40(1), (2010) 19. Tang, Y., Zhang, Y., Huang, Z.: Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics 4(3), (2007) 20. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), (2006) 21. Tomek, I.: Two Modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics 6(11), (Nov 1976) 22. Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: IEEE Symposium on Computational Intelligence and Data Mining. pp IEEE (2009)
Package ebmc. August 29, 2017
Type Package Package ebmc August 29, 2017 Title Ensemble-Based Methods for Class Imbalance Problem Version 1.0.0 Author Hsiang Hao, Chen Maintainer ``Hsiang Hao, Chen'' Four ensemble-based
More informationCombination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset
International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran
More informationIdentification of the correct hard-scatter vertex at the Large Hadron Collider
Identification of the correct hard-scatter vertex at the Large Hadron Collider Pratik Kumar, Neel Mani Singh pratikk@stanford.edu, neelmani@stanford.edu Under the guidance of Prof. Ariel Schwartzman( sch@slac.stanford.edu
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data
More informationLina Guzman, DIRECTV
Paper 3483-2015 Data sampling improvement by developing SMOTE technique in SAS Lina Guzman, DIRECTV ABSTRACT A common problem when developing classification models is the imbalance of classes in the classification
More informationIEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde
IEE 520 Data Mining Project Report Shilpa Madhavan Shinde Contents I. Dataset Description... 3 II. Data Classification... 3 III. Class Imbalance... 5 IV. Classification after Sampling... 5 V. Final Model...
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationData mining with Support Vector Machine
Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine
More informationData Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process
Vol.133 (Information Technology and Computer Science 2016), pp.79-84 http://dx.doi.org/10.14257/astl.2016. Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction
More informationFast or furious? - User analysis of SF Express Inc
CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationClassification of Imbalanced Data Using Synthetic Over-Sampling Techniques
University of California Los Angeles Classification of Imbalanced Data Using Synthetic Over-Sampling Techniques A thesis submitted in partial satisfaction of the requirements for the degree Master of Science
More informationPARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE DATA M. Athitya Kumaraguru 1, Viji Vinod 2, N.
Volume 117 No. 20 2017, 873-879 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE
More informationLouis Fourrier Fabien Gaie Thomas Rolf
CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationUsing Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions
Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Offer Sharabi, Yi Sun, Mark Robinson, Rod Adams, Rene te Boekhorst, Alistair G. Rust, Neil Davey University of
More informationSafe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem
Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem Chumphol Bunkhumpornpat, Krung Sinapiromsaran, and Chidchanok Lursinsap Department of Mathematics,
More informationSYED ABDUL SAMAD RANDOM WALK OVERSAMPLING TECHNIQUE FOR MI- NORITY CLASS CLASSIFICATION
TAMPERE UNIVERSITY OF TECHNOLOGY SYED ABDUL SAMAD RANDOM WALK OVERSAMPLING TECHNIQUE FOR MI- NORITY CLASS CLASSIFICATION Master of Science Thesis Examiner: Prof. Tapio Elomaa Examiners and topic approved
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationDeep Boltzmann Machines
Deep Boltzmann Machines Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationEfficient Feature Learning Using Perturb-and-MAP
Efficient Feature Learning Using Perturb-and-MAP Ke Li, Kevin Swersky, Richard Zemel Dept. of Computer Science, University of Toronto {keli,kswersky,zemel}@cs.toronto.edu Abstract Perturb-and-MAP [1] is
More informationA study of classification algorithms using Rapidminer
Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja
More informationA Fast Learning Algorithm for Deep Belief Nets
A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero Department of Computer Science University of Toronto, Toronto, Canada Yee-Whye Teh Department of Computer Science National
More informationMIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA
Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationConstrained Classification of Large Imbalanced Data
Constrained Classification of Large Imbalanced Data Martin Hlosta, R. Stríž, J. Zendulka, T. Hruška Brno University of Technology, Faculty of Information Technology Božetěchova 2, 612 66 Brno ihlosta@fit.vutbr.cz
More informationNaïve Bayes for text classification
Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationA Novel Data Representation for Effective Learning in Class Imbalanced Scenarios
A Novel Data Representation for Effective Learning in Class Imbalanced Scenarios Sri Harsha Dumpala, Rupayan Chakraborty and Sunil Kumar Kopparapu TCS Reseach and Innovation - Mumbai, India {d.harsha,
More informationPredict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry
Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Jincheng Cao, SCPD Jincheng@stanford.edu 1. INTRODUCTION When running a direct mail campaign, it s common practice
More informationSentiment Analysis for Amazon Reviews
Sentiment Analysis for Amazon Reviews Wanliang Tan wanliang@stanford.edu Xinyu Wang xwang7@stanford.edu Xinyu Xu xinyu17@stanford.edu Abstract Sentiment analysis of product reviews, an application problem,
More informationEnsemble Learning. Another approach is to leverage the algorithms we have via ensemble methods
Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning
More informationPredicting User Ratings Using Status Models on Amazon.com
Predicting User Ratings Using Status Models on Amazon.com Borui Wang Stanford University borui@stanford.edu Guan (Bell) Wang Stanford University guanw@stanford.edu Group 19 Zhemin Li Stanford University
More informationA Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets
A Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets M. Karagiannopoulos, D. Anyfantis, S. Kotsiantis and P. Pintelas Educational Software Development Laboratory Department of
More informationAn Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data
An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University
More informationTraining Restricted Boltzmann Machines with Overlapping Partitions
Training Restricted Boltzmann Machines with Overlapping Partitions Hasari Tosun and John W. Sheppard Montana State University, Department of Computer Science, Bozeman, Montana, USA Abstract. Restricted
More informationOptimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem
IOP Conference Series: Materials Science and Engineering PAPER OPEN ACCESS Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem To cite this article:
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationarxiv: v2 [cs.lg] 11 Sep 2015
A DEEP analysis of the META-DES framework for dynamic selection of ensemble of classifiers Rafael M. O. Cruz a,, Robert Sabourin a, George D. C. Cavalcanti b a LIVIA, École de Technologie Supérieure, University
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationComparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio
Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio Adela Ioana Tudor, Adela Bâra, Simona Vasilica Oprea Department of Economic Informatics
More informationTraining Restricted Boltzmann Machines with Multi-Tempering: Harnessing Parallelization
Training Restricted Boltzmann Machines with Multi-Tempering: Harnessing Parallelization Philemon Brakel, Sander Dieleman, and Benjamin Schrauwen Department of Electronics and Information Systems, Ghent
More informationBest First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis
Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction
More informationClassifying Imbalanced Data Sets Using. Similarity Based Hierarchical Decomposition
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition Cigdem BEYAN (Corresponding author), Robert FISHER School of Informatics, University of Edinburgh, G.12 Informatics Forum,
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationA faster model selection criterion for OP-ELM and OP-KNN: Hannan-Quinn criterion
A faster model selection criterion for OP-ELM and OP-KNN: Hannan-Quinn criterion Yoan Miche 1,2 and Amaury Lendasse 1 1- Helsinki University of Technology - ICS Lab. Konemiehentie 2, 02015 TKK - Finland
More informationA Survey on Postive and Unlabelled Learning
A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled
More informationSupervised Learning Classification Algorithms Comparison
Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining
More informationFacial Expression Classification with Random Filters Feature Extraction
Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationFeature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process
Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process KITTISAK KERDPRASOP and NITTAYA KERDPRASOP Data Engineering Research Unit, School of Computer Engineering, Suranaree
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationRacing for unbalanced methods selection
The International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2013) Racing for unbalanced methods selection Andrea DAL POZZOLO, Olivier CAELEN, Serge WATERSCHOOT and Gianluca
More informationTraining Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Ali Mirzapour Paper Presentation - Deep Learning March 7 th 1 Outline of the Presentation Restricted Boltzmann Machine
More informationCHALLENGES IN HANDLING IMBALANCED BIG DATA: A SURVEY
CHALLENGES IN HANDLING IMBALANCED BIG DATA: A SURVEY B.S.Mounika Yadav 1, Sesha Bhargavi Velagaleti 2 1 Asst. Professor, IT Dept., Vasavi College of Engineering 2 Asst. Professor, IT Dept., G.Narayanamma
More informationA Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence
2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 206) A Network Intrusion Detection System Architecture Based on Snort and Computational Intelligence Tao Liu, a, Da
More informationK-Neighbor Over-Sampling with Cleaning Data: A New Approach to Improve Classification. Performance in Data Sets with Class Imbalance
Applied Mathematical Sciences, Vol. 12, 2018, no. 10, 449-460 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/ams.2018.8231 K-Neighbor ver-sampling with Cleaning Data: A New Approach to Improve Classification
More informationTo be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine
2014 22nd International Conference on Pattern Recognition To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine Takayoshi Yamashita, Masayuki Tanaka, Eiji Yoshida, Yuji Yamauchi and Hironobu
More informationReddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011
Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions
More informationComparison of different preprocessing techniques and feature selection algorithms in cancer datasets
Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract
More informationPredict Employees Computer Access Needs in Company
Predict Employees Computer Access Needs in Company Xin Zhou & Wenqi Xiang Email: xzhou15,wenqi@stanford.edu 1.Department of Computer Science 2.Department of Electrical Engineering 1 Abstract When an employee
More informationMore Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA
More Learning Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA 1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector
More informationSupport Vector Machine with Restarting Genetic Algorithm for Classifying Imbalanced Data
Support Vector Machine with Restarting Genetic Algorithm for Classifying Imbalanced Data Keerachart Suksut, Kittisak Kerdprasop, and Nittaya Kerdprasop Abstract Algorithms for data classification are normally
More informationAkarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction
Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different
More informationPathological Lymph Node Classification
Pathological Lymph Node Classification Jonathan Booher, Michael Mariscal and Ashwini Ramamoorthy SUNet ID: { jaustinb, mgm248, ashwinir } @stanford.edu Abstract Machine learning algorithms have the potential
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationNPC: Neighbors Progressive Competition Algorithm for Classification of Imbalanced Data Sets
NPC: Neighbors Progressive Competition Algorithm for Classification of Imbalanced Data Sets Soroush Saryazdi 1, Bahareh Nikpour 2, Hossein Nezamabadi-pour 3 Department of Electrical Engineering, Shahid
More informationLearning Class-relevant Features and Class-irrelevant Features via a Hybrid third-order RBM
via a Hybrid third-order RBM Heng Luo Ruimin Shen Changyong Niu Carsten Ullrich Shanghai Jiao Tong University hengluo@sjtu.edu Shanghai Jiao Tong University rmshen@sjtu.edu Zhengzhou University iecyniu@zzu.edu.cn
More informationMachine Learning. Chao Lan
Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian
More informationThe exam is closed book, closed notes except your one-page cheat sheet.
CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right
More informationBoosting Support Vector Machines for Imbalanced Data Sets
Boosting Support Vector Machines for Imbalanced Data Sets Benjamin X. Wang and Nathalie Japkowicz School of information Technology and Engineering, University of Ottawa, 800 King Edward Ave., P.O.Box 450
More informationA Constrained Spreading Activation Approach to Collaborative Filtering
A Constrained Spreading Activation Approach to Collaborative Filtering Josephine Griffith 1, Colm O Riordan 1, and Humphrey Sorensen 2 1 Dept. of Information Technology, National University of Ireland,
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationAnalysing the Multi-class Imbalanced Datasets using Boosting Methods and Relevant Information
I J C T A, 10(9), 2017, pp. 933-947 International Science Press ISSN: 0974-5572 Analysing the Multi-class Imbalanced Datasets using Boosting Methods and Relevant Information Neelam Rout*, Debahuti Mishra**
More informationA Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis
A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract
More informationRacing for Unbalanced Methods Selection
Racing for Unbalanced Methods Selection Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi Abstract State-of-the-art classification algorithms suffer when the data is skewed towards one class. This
More informationLabel Distribution Learning. Wei Han
Label Distribution Learning Wei Han, Big Data Research Center, UESTC Email:wei.hb.han@gmail.com Outline 1. Why label distribution learning? 2. What is label distribution learning? 2.1. Problem Formulation
More informationAn Analysis of the Rule Weights and Fuzzy Reasoning Methods for Linguistic Rule Based Classification Systems Applied to Problems with Highly Imbalanced Data Sets Alberto Fernández 1, Salvador García 1,
More informationReview on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,
More informationContent Based Image Retrieval system with a combination of Rough Set and Support Vector Machine
Shahabi Lotfabadi, M., Shiratuddin, M.F. and Wong, K.W. (2013) Content Based Image Retrieval system with a combination of rough set and support vector machine. In: 9th Annual International Joint Conferences
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationUsing Decision Boundary to Analyze Classifiers
Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision
More informationImproving Imputation Accuracy in Ordinal Data Using Classification
Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz
More informationMachine Learning Lecture 3
Machine Learning Lecture 3 Probability Density Estimation II 19.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Exam dates We re in the process
More informationData Mining Classification: Alternative Techniques. Imbalanced Class Problem
Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Class Imbalance Problem Lots of classification problems
More informationOn dynamic ensemble selection and data preprocessing for multi-class imbalance learning
On dynamic ensemble selection and data preprocessing for multi-class imbalance learning Rafael M. O. Cruz Ecole de Technologie Supérieure Montreal, Canada rafaelmenelau@gmail.com Robert Sabourin Ecole
More information2. On classification and related tasks
2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.
More informationContents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation
Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4
More informationOn Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions
On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition
More informationEnergy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt
Energy Based Models, Restricted Boltzmann Machines and Deep Networks Jesse Eickholt ???? Who s heard of Energy Based Models (EBMs) Restricted Boltzmann Machines (RBMs) Deep Belief Networks Auto-encoders
More informationBioinformatics - Lecture 07
Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Learning on profiles
More informationAvailable online at ScienceDirect. Procedia Computer Science 35 (2014 )
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 35 (2014 ) 388 396 18 th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems
More informationarxiv: v4 [cs.lg] 17 Sep 2018
Meta-Learning for Resampling Recommendation Systems * Dmitry Smolyakov 1, Alexander Korotin 1, Pavel Erofeev 2, Artem Papanov 2, Evgeny Burnaev 1 1 Skolkovo Institute of Science and Technology Nobel street,
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationVisual object classification by sparse convolutional neural networks
Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.
More information