Package ebmc. August 29, 2017

Similar documents
RBM-SMOTE: Restricted Boltzmann Machines for Synthetic Minority Oversampling Technique

Package naivebayes. R topics documented: January 3, Type Package. Title High Performance Implementation of the Naive Bayes Algorithm

Analysing the Multi-class Imbalanced Datasets using Boosting Methods and Relevant Information

Package gbts. February 27, 2017

Package kernelfactory

Classification Algorithms in Data Mining

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Naïve Bayes for text classification

Package obliquerf. February 20, Index 8. Extract variable importance measure

Optimizing Classifiers for Hypothetical Scenarios

Support Vector Machine with Restarting Genetic Algorithm for Classifying Imbalanced Data

A Novel Data Representation for Effective Learning in Class Imbalanced Scenarios

Package JOUSBoost. July 12, 2017

On dynamic ensemble selection and data preprocessing for multi-class imbalance learning

CS145: INTRODUCTION TO DATA MINING

Contents. Preface to the Second Edition

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Lina Guzman, DIRECTV

Package arulescba. April 24, 2018

Classifying Imbalanced Data Sets Using. Similarity Based Hierarchical Decomposition

Package gensemble. R topics documented: February 19, 2015

Package rpst. June 6, 2017

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem

Constrained Classification of Large Imbalanced Data

Classification of Imbalanced Data Using Synthetic Over-Sampling Techniques

Package milr. June 8, 2017

Information Management course

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Optimizing Classifiers for Hypothetical Scenarios

A Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Racing for Unbalanced Methods Selection

CHALLENGES IN HANDLING IMBALANCED BIG DATA: A SURVEY

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Package maboost. R topics documented: February 20, 2015

Random Forest A. Fornaser

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Package PrivateLR. March 20, 2018

Machine Learning Techniques for Data Mining

A study of classification algorithms using Rapidminer

Slides for Data Mining by I. H. Witten and E. Frank

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Boosting Algorithms for Parallel and Distributed Learning

An introduction to random forests

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE DATA M. Athitya Kumaraguru 1, Viji Vinod 2, N.

Identification of the correct hard-scatter vertex at the Large Hadron Collider

CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Package subsemble. February 20, 2015

Fast or furious? - User analysis of SF Express Inc

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM

Cost-Sensitive Learning Methods for Imbalanced Data

The class imbalance problem

Package RandPro. January 10, 2018

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Package adabag. January 19, Type Package

Semi-supervised learning and active learning

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Boosting Support Vector Machines for Imbalanced Data Sets

Prognosis of Lung Cancer Using Data Mining Techniques

Package kerasformula

CGBoost: Conjugate Gradient in Function Space

Introduction to Automated Text Analysis. bit.ly/poir599

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Intrusion detection in computer networks through a hybrid approach of data mining and decision trees

Machine Learning in Action

Package Modeler. May 18, 2018

Ensemble methods. Ricco RAKOTOMALALA Université Lumière Lyon 2. Ricco Rakotomalala Tutoriels Tanagra -

Probabilistic Approaches

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Package EnsembleBase

Package RLT. August 20, 2018

A Survey on Postive and Unlabelled Learning

Package IPMRF. August 9, 2017

Data analysis case study using R for readily available data set using any one machine learning Algorithm

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Subject-Oriented Image Classification based on Face Detection and Recognition

Package EFS. R topics documented:

STA 4273H: Statistical Machine Learning

Network Traffic Measurements and Analysis

arxiv: v4 [cs.lg] 17 Sep 2018

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Package fbroc. August 29, 2016

8. Tree-based approaches

Statistics 202: Statistical Aspects of Data Mining

Allstate Insurance Claims Severity: A Machine Learning Approach

An Empirical Comparison of Spectral Learning Methods for Classification

Package auctestr. November 13, 2017

Predict Employees Computer Access Needs in Company

Nonparametric Approaches to Regression

Package randomforest.ddr

CS249: ADVANCED DATA MINING

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

A Bagging Method using Decision Trees in the Role of Base Classifiers

Package CALIBERrfimpute

Package preprocomb. June 26, 2016

Package Cubist. December 2, 2017

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

NEAREST-INSTANCE-CENTROID-ESTIMATION LINEAR DISCRIMINANT ANALYSIS (NICE LDA) Rishabh Singh, Kan Li (Member, IEEE) and Jose C. Principe (Fellow, IEEE)

Transcription:

Type Package Package ebmc August 29, 2017 Title Ensemble-Based Methods for Class Imbalance Problem Version 1.0.0 Author Hsiang Hao, Chen Maintainer ``Hsiang Hao, Chen'' <kbman1101@gmail.com> Four ensemble-based methods (SMOTEBoost, RUSBoost, UnderBagging, and SMOTE- Bagging) for class imbalance problem are implemented for binary classification. Such methods adopt ensemble methods and data re-sampling techniques to improve model performance in presence of class imbalance problem. One special feature offers the possibility to choose multiple supervised learning algorithms to build weak learners within ensemble models. References: Nitesh V. Chawla, Aleksandar Lazarevic, Lawrence O. Hall, and Kevin W. Bowyer (2003) <doi:10.1007/978-3-540-39804- 2_12>, Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, and Amri Napolitano (2010) <doi:10.1109/tsmca.2009.2029559>, R. Barandela, J. S. Sanchez, R. M. Valdovinos (2003) <doi:10.1007/s10044-003-0192- z>, Shuo Wang and Xin Yao (2009) <doi:10.1109/cidm.2009.4938667>, Yoav Freund and Robert E. Schapire (1997) <doi:10.1006/jcss.1997.1504>. Depends methods Imports e1071, rpart, C50, randomforest, DMwR, proc License GPL (>= 3) Encoding UTF-8 LazyData true NeedsCompilation no Repository CRAN Date/Publication 2017-08-29 15:44:12 UTC R topics documented: adam2............................................ 2 measure........................................... 3 predict.modelbag...................................... 5 1

2 adam2 predict.modelbst...................................... 6 rus.............................................. 7 sbag............................................. 9 sbo.............................................. 10 ub.............................................. 12 Index 14 adam2 Implementation of AdaBoost.M2 The function implements AdaBoost.M2 for binary classification. It returns a list of weak learners that are built on random under-sampled training-sets, and a vector of error estimations of each weak learner. The weak learners altogether consist the ensemble model. adam2(formula, data, size, alg, rf.ntree = 50, svm.ker = "radial") formula data size alg rf.ntree svm.ker A formula specify predictors and target variable. Target variable should be a factor of 0 and 1. Predictors can be either numerical and categorical. A data frame used for training the model, i.e. training set. Ensemble size, i.e. number of weak learners in the ensemble model. The learning algorithm used to train weak learners in the ensemble model. cart, c50, rf, nb, and svm are available. Please see Details for more information. Number of decision trees in each forest of the ensemble model when using rf (Random Forest) as base learner. Integer is required. Specifying kernel function when using svm as base algorithm. Four options are available: linear, polynomial, radial, and sigmoid. Default is radial. Equivalent to that in e1071::svm(). Details AdaBoost.M2 is an extension of AdaBoost. AdaBoost.M2 introduces pseudo-loss, which is a more sophisticated method to estimate error and update instance weight in each iteration compared to AdaBoost and AdaBoost.M1. Although AdaBoost.M2 is originally implemented with decision tree, this function makes it possible to use other learning algorithms for building weak learners. Argument alg specifies the learning algorithm used to train weak learners within the ensemble model. Totally five algorithms are implemented: cart (Classification and Regression Tree), c50 (C5.0 Decision Tree), rf (Random Forest), nb (Naive Bayes), and svm (Support Vector Machine). When using Random Forest as base learner, the ensemble model is consisted of forests and each forest contains a number of trees.

measure 3 The function requires the target varible to be a factor of 0 and 1, where 1 indicates minority while 0 indicates majority instances. Only binary classification is implemented in this version. The object class of returned list is defined as modelbst, which can be directly passed to predict() for predicting test instances. Value The function returns a list containing two elements: weaklearners A list of weak learners. errorestimation Error estimation of each weak learner. Calculated by using (pseudo_loss + smooth) / (1 - pseudo_loss + smooth). smooth helps prevent error rate = 0 resulted from perfect classfication during trainging iterations. For more information, please see Schapire et al. (1999) Section 4.2. References Freund, Y. and Schapire, R. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences. 55, pp. 119-139. Freund, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. Machine Learning: In Proceedings of the 13th International Conference. pp. 148-156 Schapire, R. and Singer, Y. 1999. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning. 37(3). pp. 297-336. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. 2012. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 42(4), pp. 463-484. model1 <- adam2(species ~., data = iris, size = 10, alg = "c50") model2 <- adam2(species ~., data = iris, size = 20, alg = "rf", rf.ntree = 100) model3 <- adam2(species ~., data = iris, size = 40, alg = "svm", svm.ker = "sigmoid") measure Calculating Performance Measurement in Class Imbalance Problem The function is an interation of multiple performance measurements that can be used to assess model performance in class imbalance problem. Totally six measurements are included.

4 measure measure(label, probability, metric, threshold = 0.5) label probability A vector of actual labels of target variable in test set. A vector of probability estimated by the model. metric Measurement used for assessing model performance. auc, gmean, tpr, tnr, f, and acc are available. Please see Details for more information. threshold Probability threshold for determining the class of instances. A numerical value ranging from 0 to 1. Default is 0.5 Details This function integrates six common measurements. It uses proc::roc() and proc::auc() to calculate auc (Area Under Curve), while calculates other measurements without dependency on other package: gmean (Geometric Mean), tpr (True Positive Rate), tnr (True Negative Rate),and f (F- Measure). acc (Accuracy) is also included for any possible use, although such measurement can be misleading when the classes of test set is highly imbalanced. threshold is the probability cutoff for determing the predicted class of instances. For AUC, users do not need to specify threshold because AUC is not affected by the probability cutoff. However, the threshold is required for other five measurements. # Creat training and test set samp <- sample(nrow(iris), nrow(iris) * 0.7) train <- iris[samp, ] test <- iris[-samp, ] # Model building and prediction model <- rus(species ~., data = train, size = 10, alg = "c50") prob <- predict(model, newdata = test, type = "prob") # Calculate measurements auc <- measure(label = test$species, probability = prob, metric = "auc") gmean <- measure(label = test$species, probability = prob, metric = "gmean", threshold = 0.5)

predict.modelbag 5 predict.modelbag Predict Method for modelbag Object Predicting instances in test set using modelbag object ## S3 method for class 'modelbag' predict(object, newdata, type = "prob",...) object newdata type A object of modelbag class. A data frame object containing new instances. Types of output, which can be prob (probability) and class (predicted label). Default is prob.... Not used currently. Value Two type of output can be selected: prob class Estimated probability of being a minority instance (i.e. 1). The probability is averaged by using an equal-weight majority vote by all weak learners. Predicted class of the instance. Instances of probability larger than 0.5 are predicted as 1, otherwise 0. samp <- sample(nrow(iris), nrow(iris) * 0.7) train <- iris[samp, ] test <- iris[-samp, ] model <- ub(species ~., data = train, size = 10, alg = "c50") # Build UnderBagging model prob <- predict(model, newdata = test, type = "prob") # return probability estimation pred <- predict(model, newdata = test, type = "class") # return predicted class

6 predict.modelbst predict.modelbst Predict Method for modelbst Object Predicting instances in test set using modelbst object ## S3 method for class 'modelbst' predict(object, newdata, type = "prob",...) object newdata type A object of modelbst class. A data frame object containing new instances. Types of output, which can be prob (probability) and class (predicted label). Default is prob.... Not used currently. Value Two type of output can be selected: prob class Estimated probability of being a minority instance (i.e. 1). The probability is averaged by using a majority vote by all weak learners, weighted by error estimation. Predicted class of the instance. Instances of probability larger than 0.5 are predicted as 1, otherwise 0. samp <- sample(nrow(iris), nrow(iris) * 0.7) train <- iris[samp, ] test <- iris[-samp, ] model <- rus(species ~., data = train, size = 10, alg = "c50") # Build RUSBoost model prob <- predict(model, newdata = test, type = "prob") # return probability estimation pred <- predict(model, newdata = test, type = "class") # return predicted class

rus 7 rus Implementation of RUSBoost The function implements RUSBoost for binary classification. It returns a list of weak learners that are built on random under-sampled training-sets, and a vector of error estimations of each weak learner. The weak learners altogether consist the ensemble model. rus(formula, data, size, alg, ir = 1, rf.ntree = 50, svm.ker = "radial") formula data size alg ir rf.ntree svm.ker A formula specify predictors and target variable. Target variable should be a factor of 0 and 1. Predictors can be either numerical and categorical. A data frame used for training the model, i.e. training set. Ensemble size, i.e. number of weak learners in the ensemble model. The learning algorithm used to train weak learners in the ensemble model. cart, c50, rf, nb, and svm are available. Please see Details for more information. Imbalance ratio. Specifying how many times the under-sampled majority instances are over minority instances. Interger is not required and so such as ir = 1.5 is allowed. Number of decision trees in each forest of the ensemble model when using rf (Random Forest) as base learner. Integer is required. Specifying kernel function when using svm as base algorithm. Four options are available: linear, polynomial, radial, and sigmoid. Default is radial. Equivalent to that in e1071::svm(). Details Based on AdaBoost.M2, RUSBoost uses random under-sampling to reduce majority instances in each iteration of training weak learners. A 1:1 under-sampling ratio (i.e. equal numbers of majority and minority instances) is set as default. The function requires the target varible to be a factor of 0 and 1, where 1 indicates minority while 0 indicates majority instances. Only binary classification is implemented in this version. Argument alg specifies the learning algorithm used to train weak learners within the ensemble model. Totally five algorithms are implemented: cart (Classification and Regression Tree), c50 (C5.0 Decision Tree), rf (Random Forest), nb (Naive Bayes), and svm (Support Vector Machine). When using Random Forest as base learner, the ensemble model is consisted of forests and each forest contains a number of trees. ir refers to the intended imbalance ratio of training sets for manipulation. With ir = 1 (default), the numbers of majority and minority instances are equal after class rebalancing. With ir = 2, the

8 rus number of majority instances is twice of that of minority instances. Interger is not required and so such as ir = 1.5 is allowed. The object class of returned list is defined as modelbst, which can be directly passed to predict() for predicting test instances. Value The function returns a list containing two elements: weaklearners A list of weak learners. errorestimation Error estimation of each weak learner. Calculated by using (pseudo_loss + smooth) / (1 - pseudo_loss + smooth). smooth helps prevent error rate = 0 resulted from perfect classfication during trainging iterations. For more information, please see Schapire et al. (1999) Section 4.2. References Seiffert, C., Khoshgoftaar, T., Hulse, J., and Napolitano, A. 2010. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans. 40(1), pp. 185-197. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. 2012. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 42(4), pp. 463-484. Freund, Y. and Schapire, R. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences. 55, pp. 119-139. Freund, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. Machine Learning: In Proceedings of the 13th International Conference. pp. 148-156 Schapire, R. and Singer, Y. 1999. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning. 37(3). pp. 297-336. model1 <- rus(species ~., data = iris, size = 10, alg = "c50", ir = 1) model2 <- rus(species ~., data = iris, size = 20, alg = "rf", ir = 1, rf.ntree = 100) model3 <- rus(species ~., data = iris, size = 40, alg = "svm", ir = 1, svm.ker = "sigmoid")

sbag 9 sbag Implementation of SMOTEBagging The function implements SMOTEBagging for binary classification. It returns a list of weak learners that are built on training-sets manipulated by SMOTE and random over-sampling. They together consist the ensemble model. sbag(formula, data, size, alg, rf.ntree = 50, svm.ker = "radial") formula data size alg rf.ntree svm.ker A formula specify predictors and target variable. Target variable should be a factor of 0 and 1. Predictors can be either numerical and categorical. A data frame used for training the model, i.e. training set. Ensemble size, i.e. number of weak learners in the ensemble model. The learning algorithm used to train weak learners in the ensemble model. cart, c50, rf, nb, and svm are available. Please see Details for more information. Number of decision trees in each forest of the ensemble model when using rf (Random Forest) as base learner. Integer is required. Specifying kernel function when using svm as base algorithm. Four options are available: linear, polynomial, radial, and sigmoid. Default is radial. Equivalent to that in e1071::svm(). Details SMOTEBagging uses both SMOTE (Synthetic Minority Over-sampling TEchnique) and random over-sampling to increase minority instances in each bag of Bagging in order to rebalance class distribution. The manipulated training sets contain equal numbers of majority and minority instances, but the proportions of minority instances from SMOTE and random over-sampling vary for different bags, determined by an assigned re-sampling rate a. The re-sampling rate a is always the multiple of 10, and the function automatically generates a vector of a, therefore users do not need to self-define. The function requires the target varible to be a factor of 0 and 1, where 1 indicates minority while 0 indicates majority instances. Only binary classification is implemented in this version. Argument alg specifies the learning algorithm used to train weak learners within the ensemble model. Totally five algorithms are implemented: cart (Classification and Regression Tree), c50 (C5.0 Decision Tree), rf (Random Forest), nb (Naive Bayes), and svm (Support Vector Machine). When using Random Forest as base learner, the ensemble model is consisted of forests and each forest contains a number of trees. The object class of returned list is defined as modelbag, which can be directly passed to predict() for predicting test instances.

10 sbo References Wang, S. and Yao, X. 2009. Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models. IEEE Symposium on Computational Intelligence and Data Mining, CIDM 09. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. 2012. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 42(4), pp. 463-484. model1 <- sbag(species ~., data = iris, size = 10, alg = "c50") model2 <- sbag(species ~., data = iris, size = 20, alg = "rf", rf.ntree = 100) model3 <- sbag(species ~., data = iris, size = 40, alg = "svm", svm.ker = "sigmoid") sbo Implementation of SMOTEBoost The function implements SMOTEBoost for binary classification. It returns a list of weak learners that are built on SMOTE-manipulated training-sets, and a vector of error estimations of each weak learner. The weak learners altogether consist the ensemble model. sbo(formula, data, size, alg, over = 100, rf.ntree = 50, svm.ker = "radial") formula data size alg over rf.ntree svm.ker A formula specify predictors and target variable. Target variable should be a factor of 0 and 1. Predictors can be either numerical and categorical. A data frame used for training the model, i.e. training set. Ensemble size, i.e. number of weak learners in the ensemble model. The learning algorithm used to train weak learners in the ensemble model. cart, c50, rf, nb, and svm are available. Please see Details for more information. Specifying over-sampling rate of SMOTE. Only multiple of 100 is acceptable. Number of decision trees in each forest of the ensemble model when using rf (Random Forest) as base learner. Integer is required. Specifying kernel function when using svm as base algorithm. Four options are available: linear, polynomial, radial, and sigmoid. Default is radial. Equivalent to that in e1071::svm().

sbo 11 Details Value Based on AdaBoost.M2, SMOTEBoost uses SMOTE (Synthetic Minority Over-sampling TEchnique) to increase minority instances in each iteration of training weak learners. An over-sampling rate of SMOTE can be defined by users with argument over. The function requires the target varible to be a factor of 0 and 1, where 1 indicates minority while 0 indicates majority instances. Only binary classification is implemented in this version. Argument alg specifies the learning algorithm used to train weak learners within the ensemble model. Totally five algorithms are implemented: cart (Classification and Regression Tree), c50 (C5.0 Decision Tree), rf (Random Forest), nb (Naive Bayes), and svm (Support Vector Machine). When using Random Forest as base learner, the ensemble model is consisted of forests and each forest contains a number of trees. The object class of returned list is defined as modelbst, which can be directly passed to predict() for predicting test instances. The function returns a list containing two elements: weaklearners A list of weak learners. errorestimation Error estimation of each weak learner. Calculated by using (pseudo_loss + smooth) / (1 - pseudo_loss + smooth). smooth helps prevent error rate = 0 resulted from perfect classfication during trainging iterations. For more information, please see Schapire et al. (1999) Section 4.2. References Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. 2003. SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In Proceedings European Conference on Principles of Data Mining and Knowledge Discovery. pp. 107-119 Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. 2012. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 42(4), pp. 463-484. Freund, Y. and Schapire, R. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences. 55, pp. 119-139. Freund, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. Machine Learning: In Proceedings of the 13th International Conference. pp. 148-156 Schapire, R. and Singer, Y. 1999. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning. 37(3). pp. 297-336. model1 <- sbo(species ~., data = iris, size = 10, over = 100, alg = "c50")

12 ub model2 <- sbo(species ~., data = iris, size = 20, over = 200, alg = "rf", rf.ntree = 100) model3 <- sbo(species ~., data = iris, size = 40, over = 300, alg = "svm", svm.ker = "sigmoid") ub Implementation of UnderBagging The function implements UnderBagging for binary classification. It returns a list of weak learners that are built on random under-sampled training-sets. They together consist the ensemble model. ub(formula, data, size, alg, ir = 1, rf.ntree = 50, svm.ker = "radial") formula data size alg ir rf.ntree svm.ker A formula specify predictors and target variable. Target variable should be a factor of 0 and 1. Predictors can be either numerical and categorical. A data frame used for training the model, i.e. training set. Ensemble size, i.e. number of weak learners in the ensemble model. The learning algorithm used to train weak learners in the ensemble model. cart, c50, rf, nb, and svm are available. Please see Details for more information. Imbalance ratio. Specifying how many times the under-sampled majority instances are over minority instances. Interger is not required and so such as ir = 1.5 is allowed. Number of decision trees in each forest of the ensemble model when using rf (Random Forest) as base learner. Integer is required. Specifying kernel function when using svm as base algorithm. Four options are available: linear, polynomial, radial, and sigmoid. Default is radial. Equivalent to that in e1071::svm(). Details UnderBagging uses random under-sampling to reduce majority instances in each bag of Bagging in order to rebalance class distribution. A 1:1 under-sampling ratio (i.e. equal numbers of majority and minority instances) is set as default. The function requires the target varible to be a factor of 0 and 1, where 1 indicates minority while 0 indicates majority instances. Only binary classification is implemented in this version. Argument alg specifies the learning algorithm used to train weak learners within the ensemble model. Totally five algorithms are implemented: cart (Classification and Regression Tree), c50 (C5.0 Decision Tree), rf (Random Forest), nb (Naive Bayes), and svm (Support Vector Machine). When using Random Forest as base learner, the ensemble model is consisted of forests and each forest contains a number of trees.

ub 13 ir refers to the intended imbalance ratio of training sets for manipulation. With ir = 1 (default), the numbers of majority and minority instances are equal after class rebalancing. With ir = 2, the number of majority instances is twice of that of minority instances. Interger is not required and so such as ir = 1.5 is allowed. The object class of returned list is defined as modelbag, which can be directly passed to predict() for predicting test instances. References Barandela, R., Sanchez, J., and Valdovinos, R. 2003. New Applications of Ensembles of Classifiers. Pattern Analysis and Applications. 6(3), pp. 245-256. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. 2012. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 42(4), pp. 463-484. model1 <- ub(species ~., data = iris, size = 10, alg = "c50", ir = 1) model2 <- ub(species ~., data = iris, size = 20, alg = "rf", ir = 1, rf.ntree = 100) model3 <- ub(species ~., data = iris, size = 40, alg = "svm", ir = 1, svm.ker = "sigmoid")

Index adam2, 2 measure, 3 predict.modelbag, 5 predict.modelbst, 6 rus, 7 sbag, 9 sbo, 10 ub, 12 14