A Selective Sampling Method for Imbalanced Data Learning on Support Vector Machines

Size: px

Start display at page:

Download "A Selective Sampling Method for Imbalanced Data Learning on Support Vector Machines"

Eleanor Harrington
5 years ago
Views:

Iowa State Unversty Dgtal Repostory @ Iowa State Unversty Graduate Theses and Dssertatons Graduate College 2010 A Selectve Samplng Method for Imbalanced Data Learnng on Support Vector Machnes Jong

1 Iowa State Unversty Dgtal Iowa State Unversty Graduate Theses and Dssertatons Graduate College 2010 A Selectve Samplng Method for Imbalanced Data Learnng on Support Vector Machnes Jong Myong Cho Iowa State Unversty Follow ths and addtonal works at: Part of the Industral Engneerng Commons Recommended Ctaton Cho, Jong Myong, "A Selectve Samplng Method for Imbalanced Data Learnng on Support Vector Machnes" (2010). Graduate Theses and Dssertatons. Paper Ths Dssertaton s brought to you for free and open access by the Graduate College at Dgtal Iowa State Unversty. It has been accepted for ncluson n Graduate Theses and Dssertatons by an authorzed admnstrator of Dgtal Iowa State Unversty. For more nformaton, please contact hnefuku@astate.edu.

2 A selectve samplng method for mbalanced data learnng on support vector machnes by Jong Myong Cho A dssertaton submtted to the graduate faculty n partal fulfllment of the requrements for the degree of DOCTOR OF PHILOSOPHY Major: Industral Engneerng Program of Study Commttee: John K. Jackman, Major Professor Sgurdur Olafsson Douglas D. Gemmll Danne H. Cook Anthony M. Townsend Iowa State Unversty Ames, Iowa 2010 Copyrght Jong Myong Cho, All rghts reserved.

3 TABLE OF CONTENTS LIST OF TABLES... v LIST OF FIGURES... v ACKNOWLEDGEMENTS... v ABSTRACT.... v CHAPTER 1 INTRODUCTION... 1 CHAPTER 2 LITERATURE REVIEW Handlng the Class Imbalance Problem Changng class dstrbutons Adjustng classfers to mbalanced data sets Ensemble learnng methods Performance Measures for Imbalanced Data Learnng Summary and Research Scope CHAPTER 3 CLASS IMBALANCE PROBLEM WITH SUPPORT VECTOR MACHINE LEARNING Support Vector Machne (SVM) Classfer SVMs and the Skewed Boundary Problems assocated wth SVM classfer for mbalanced data Effectveness of rebalancng class dstrbuton Hypotheses CHAPTER 4 SELECTIVE SAMPLING USING A GENETIC ALGORITHM SVMs for Large-Scale Datasets Genetc Algorthm for Under-samplng of the Majorty Class Experments Expermental Desgn Approaches for mbalanced data learnng Expermental Results and Dscusson CHAPTER 5 SMALLER LEARNING SETS for IMBALANCED DATA LEARNING wth SVMs A new method to reduce learnng tme Stage1. Rough elmnaton of support vectors of the majorty class n kernel space Stage2. Selecton of majorty nstance support vectors Demonstraton of GA-SS Lnear kernel functon case... 59

4 5.2.2 Gaussan radal-based kernel functon case Experments wth real datasets Summary and Dscussons CHAPTER 6 CONCLUSIONS APPENDIX A. EXPERIMENTAL IMBALANCED TRAINING SETS APPENDIX B. PARAMETER C AND SETTING THROUGH 5-FOLD CROSS VALIDATION APPENDIX C. INITIAL CLASSIFICATION RESULTS on SVM LEARNING WITH THE ORIGINAL TRAINING AND TEST DATASETS APPENDIX C. INITIAL CLASSIFICATION RESULTS on SVM LEARNING WITH THE ORIGINAL TRAINING AND TEST DATASETS APPENDIX D. MEAN DIFFERENCE BEWTEEN g-mean OF the tranng set IN TERMS OF THREE APPROACHES (SVM-SMOTE, SVM-RU and GA-IS) APPENDIX E. MEAN DIFFERENCE BEWTEEN g-mean OF the tranng set CORRESPONDING TO ITERATIONS CHOSEN FOR INSTANCE SELECTION APPENDIX F. SELECTED INSTANCES FOR LEARNING FROM GENETIC ALGORITHM BASED INSTACE SELECTION APPROACH BIBLIOGRAPHY... 95

5 v LIST OF TABLES Table 2.1 Cost matrx Table 2.2 Confuson matrx for performance evaluaton Table 4.1 Descrptons of expermental datasets Table 4.2 G-mean values of the expermental datasets on SVM (tranng and test sets) Table 4.3 Average G-mean of Tranng sets obtaned from 4 dfferent methods Table 5.1 Improvement of G-mean and tranng set reducton after Stage Table 5.2 Average G-mean of Test sets n terms of 5 dfferent methods n 20 runs... 72

6 v LIST OF FIGURES Fgure 2.1 Synthetc over-samplng example by SMOTE algorthm... 7 Fgure 3.1 Lnear separatng hyperplanes for the separable case Fgure 3.2 Lnear separatng hyperplanes for the non-separable case Fgure 3.3 Example of class mbalance problem on SVMs Fgure 3.4 Boundary movements by SMOTE algorthm Fgure 3.5 Boundary movements by random under-samplng Fgure 4.1 GA-based Instance Selecton from the majorty nstances Fgure 4.2 Class dstrbutons of expermental datasets (abalone and yeast) Fgure 4.3 Average g-mean values of the tranng dataset n terms of ncrease of synthetc mnorty nstances by SMOTE Fgure 5.1 Stage 1 Algorthm Fgure 5.2 Boundary senstvty to removng one SV nstance Fgure 5.3 GA-SS Algorthm Fgure 5.4 Boundary movement through selectng nstances n Stage Fgure 5.5 Overall procedure of our approach for mbalanced tranng datasets Fgure 5.6 Decson boundares at each teraton through selectng nstances from SVs of the majorty class (lnear kernel functon) Fgure 5.7 Mappng decson boundares at Iteraton 1 and 2 on the orgnal tranng set Fgure 5.8 A decson boundary that produces the maxmum G-mean for the orgnal tranng set Fgure 5.9 Decson boundares for Stage 1 teratons Fgure 5.10 Mappng decson boundares for Iteratons 2 and 3 on the orgnal tranng dataset Fgure 5.11 Decson boundary that produces the maxmum G-mean of the orgnal tranng set through GA-SS Fgure 5.12 Trend of G-mean values of the orgnal tranng set on reducton of the majorty nstances n Stage Fgure 5.13 Box plots of G-mean of the tranng set after Stage 2 for Fgure 5.14 Comparson of G-mean values for the tranng sets (average) for 5 dfferent methods Fgure 5.15 Sze of tranng sets Fgure 5.16 Comparson of learnng tme... 71

7 v ACKNOWLEDGEMENTS It s a pleasure to thank many people who helped me conduct research toward a Ph.D. and wrte ths dssertaton. Ths work would not have been fnshed wthout ther support and patence. Frst, I would lke to heartly express the deepest apprecaton to my advser, Dr. John Jackman for hs encouragng way to help me whenever I am n a trouble durng my graduate school years. Hs persstent gudance enabled me to complete ths work. I am also thankful to my commttee members, Dr. Sgurdur Olafsson, Dr. Doug Gemmll, Dr. Anthony Townsend and Dr. Danne Cook whose valuable and helpful comments mproved ths work. I also would lke to thank my famly members: my parents for educatng me wth uncondtonal support to fnsh my study. Especally, my specal thanks goes to my wfe, In Suk Lee who always understands and encourages me wth her support, patence and endless love throughout my lfe.

8 v ABSTRACT The class mbalance problem n classfcaton has been recognzed as a sgnfcant research problem n recent years and a number of methods have been ntroduced to mprove classfcaton results. Rebalancng class dstrbutons (such as over-samplng or under-samplng of learnng datasets) has been popular due to ts ease of mplementaton and relatvely good performance. For the Support Vector Machne (SVM) classfcaton algorthm, research efforts have focused on reducng the sze of learnng sets because of the algorthm s senstvty to the sze of the dataset. In ths dssertaton, we propose a metaheurstc approach (Genetc Algorthm) for undersamplng of an mbalanced dataset n the context of a SVM classfer. The goal of ths approach s to fnd an optmal learnng set from mbalanced datasets wthout emprcal studes that are normally requred to fnd an optmal class dstrbuton. Expermental results usng real datasets ndcate that ths metaheurstc under-samplng performed well n rebalancng class dstrbutons. Furthermore, an teratve samplng methodology was used to produce smaller learnng sets by removng redundant nstances. It ncorporates nformatve and the representatve under-samplng mechansms to speed up the learnng procedure for mbalanced data learnng wth a SVM. When compared wth exstng rebalancng methods and the metaheurstc approach to under-samplng, ths teratve methodology not only provdes good performance but also enables a SVM classfer to learn usng very small learnng sets for mbalanced data learnng. For large-scale mbalanced datasets, ths methodology provdes an effcent and effectve soluton for mbalanced data learnng wth an SVM.

9 1 CHAPTER 1 INTRODUCTION The mbalanced learnng problem n data mnng has attracted a sgnfcant amount of nterest from the research communty and practtoners because real-world datasets are frequently mbalanced, havng a mnorty class wth relatvely few nstances when compared to the other classes n the dataset. Standard classfcaton algorthms used n supervsed learnng have dffcultes n correctly classfyng the mnorty class. Most of these algorthms assume a balanced dstrbuton of classes and equal msclassfcaton costs for each class. In addton, these algorthms are desgned to generalze from sample data and output the smplest hypothess that best fts the data. Ths prncple s embedded n the nductve bas of many machne learnng algorthms ncludng Decson Tree, nearest neghbor, and Support Vector Machne (SVM). Therefore, when they are used on complex mbalanced data sets, these algorthms are nclned to be overwhelmed by the majorty class and gnore the mnorty class causng errors n classfcaton for the mnorty class. In other words, standard classfcaton algorthms try to mnmze the overall classfcaton error rate by producng a based hypothess whch regards almost all nstances as the majorty class. Recent research on the class mbalance problem has ncluded studes on datasets from a wde varety of contexts such as, nformaton retreval and flterng (Lews & Catlett, 1994), dagnoss of rare thyrod dsease (Murphy & Aha, 1994), text classfcaton (Chawla et al., 2002), credt card fraud detecton (Wu & Chang, 2003) and detecton of ol splls from satellte mages (Kubat et al., 1998). The degree of mbalance vares dependng on the context. In ntruson detecton, typcally less than 10% of the data are actual ntrusons. In detecton of cancerous cells, less than 1% of cells are actually cancerous.

10 2 To llustrate the mbalance problem, consder the Mammography Data Set, whch has been used frequently to study the class mbalance learnng problem. Ths data s a collecton of mages obtaned from a seres of mammography exams conducted on a set of dstnct patents. Analyzng the mages n the two classes, cancerous and noncancerous patent, t s observed that the number of noncancerous patents greatly exceeds the number of cancerous patents. Indeed, ths data set contans 10,923 Negatve (major class) samples and 260 postve (mnorty class) samples. Ideally, a classfer should classfy both classes wth almost 100% accuracy. However, classfers tend to produce severely based classfcaton wth the majorty class almost 100% accuracy and conversely the mnorty class havng accuraces of less than 0.5% accuracy. As a result, most cancerous patents are classfed as noncancerous (.e., a Type I error). In the classfcaton of dagnosng patents, such a consequence would be extremely costly because treatment would not be ntated. For these mbalanced scenaros, classfers should provde much hgher accuracy for the mnorty class wthout a sgnfcant loss n accuracy for the majorty class. New classfcaton methods are needed to address the class mbalance problem n supervsed learnng. In ths research, we propose a new methodology for mbalanced data learnng based on the SVM classfcaton algorthm. The remander of the dssertaton s organzed as follows. In Chapter 2, we revew general approaches for mbalanced data learnng and related studes and descrbe the scope of ths research. Chapter 3 brefly descrbes the SVM classfcaton algorthm and the causes of mbalanced learnng wth the SVM classfer. Ths s followed by a dscusson of exstng methods that have been used for the class mbalance problem based on the SVM algorthm. At the end of ths chapter, the approach used n the new methodology s descrbed. In Chapter 4, we address a crtcal ssue n SVM learnng, namely,

11 3 large-scale data, and an optmzaton based under-samplng method usng Genetc Algorthm s ntroduced and classfcaton results are compared wth other samplng methods usng real datasets. In Chapter 5, the new methodology that solves the class mbalance problem on SVM wth relatvely small learnng sets for SVM classfer s descrbed. Fnally we conclude wth suggestons on future research drectons n Chapter 6.

12 4 CHAPTER 2 LITERATURE REVIEW In general, a class mbalance problem s seen n two stuatons namely, natural mbalance or rarty of cases (.e., nstances or samples). Underlyng reasons for mbalance could be the lack of occurrences n nature for a specfc phenomena or possbly nsuffcent funds or tme to collect suffcent data. In recent years, many researchers have studed the class mbalance problem. Wess (2004) presented an overvew of the feld of learnng from mbalanced datasets. Hs work partcularly focused on the problems wth dentfyng rare objects n data mnng by defnng two types of rarty: rare classes and rare cases. A rare class contans relatvely smaller nstances than other classes, whle a rare case ndcates a small subset of the data (nstance) space. Unsupervsed learnng algorthms such as clusterng may help to dentfy a rare case. More generally, class mbalance s related to rare classes and s assocated wth classfcaton problems. In hs work, Wess argued that typcal evaluaton metrcs do not adequately descrbe the value of rarty so that data mnng s not lkely to handle rare classes and rare cases. Monard et al. (2002) dscussed several ssues related to learnng wth skewed class dstrbutons, such as the relatonshp between cost-senstve learnng and class dstrbutons, and the lmtatons of accuracy and error rate n measurng the performance of classfers. 2.1 Handlng the Class Imbalance Problem The varous approaches used to deal wth the class mbalance problem can be grouped nto three categores: (1) changng class dstrbutons (modfyng the data tself to rebalance skewed datasets at the data level), (2) adjustment of classfers (adjust standard classfcaton

13 5 algorthms to mbalanced data sets by applyng cost or weght for msclassfed cases), and (3) ensemble learnng methods (usng a combnaton of multple classfers wth multple datasets) Changng class dstrbutons Changng class dstrbutons s performed at the data level n order to modfy class dstrbuton n the tranng datasets. Snce many more nstances belong to the majorty class than the mnorty class, class dstrbuton can be balanced by under-samplng the majorty class, oversamplng the mnorty class, combnng under-samplng and over-samplng, or some other samplng method. Studes have shown that a balanced data set provdes mproved classfcaton performance as compared wth an mbalanced data set. There have been numerous studes on changng class dstrbuton (Laurkkala, 2001 and Estabrooks et al., 2004). Also, Wess (2003) nvestgated the effect of class dstrbuton on decson tree classfcaton by changng class dstrbutons to acheve dfferent ratos and measurng performance usng accuracy and Area Under the Curve (AUC). Three basc technques are used n balancng classes namely, heurstc and non-heurstc under-samplng, heurstc and non-heurstc over-samplng, and advanced samplng. Japkowcz (2000) compared multple balancng methods and concluded that both under-samplng and over-samplng are very effectve methods for dealng wth the class mbalance problem. Over-samplng One smple over-samplng method s random over-samplng. Its mechansm s addng a set E of addtonal nstances (.e., nstance duplcates) randomly sampled from the mnorty class to the orgnal set, S. In ths way, the number of total nstances of the mnorty classs ncreased

14 6 by E and as a result, the class dstrbuton s more balanced. Ths provdes a mechansm for varyng the degree of class dstrbuton balance to any desred balance level. Over-samplng does not ncrease nformaton; nstead by replcaton t rases the weght of the mnorty samples. The problem wth over-samplng s that an over-fttng problem wll generally occur, whch causes the classfcaton rule to become too specfc; even though the accuracy for tranng set s hgh, the classfcaton performance for new test datasets wll lkely be worse. By appendng duplcated data to the orgnal data set, some of the data coped becomes too specfc and classfers wll produce multple clauses for the duplcate data (Kubat and Martn, 1997). To avod the over-fttng problem n over-samplng, Chawla et al. (2002) suggested a heurstc over-samplng method, called Synthetc Mnorty Over-samplng Technque (SMOTE), whch has worked well n varous applcatons. SMOTE s consdered to be one of the state-ofthe-art approaches for mbalanced learnng. Ths method generates synthetc data based on the feature space smlartes between exstng mnorty nstances consderng the K-nearest neghbors for each mnorty nstance. In order to create a synthetc nstance, t fnds the K-nearest neghbors of each mnorty nstance, randomly selects one of them, and then multples the correspondng feature vector dfference wth a random number between 0 and 1 to produce a new mnorty nstance n the neghborhood. Fgure 1 shows an example of the SMOTE procedure. Ths synthetc over-samplng avods the over-fttng problem and also causes the decson boundares for the mnorty class to move towards the majorty class. As a varant of SMOTE, Han et al. (2005) ntroduced Borderlne_SMOTE whch only oversamples synthetc nstances of the mnorty class near the decson boundary snce those nstances are most lkely to be msclassfed. Results were better when compared to standard SMOTE and random oversamplng usng Decson Tree classfcaton. He et al. (2008) ntroduced a synthetc samplng

15 7 method, Adaptve Synthetc Samplng (ADASYN), that uses a densty dstrbuton of the mnorty nstance as a crteron to automatcally decde the number of synthetc samples generated for each mnorty nstance. ADASYN generates a new nstance by calculatng the class rato of the mnorty and majorty nstances n the K-nearest neghbors of each mnorty nstance. As a result, more synthetc nstances are generated for mnorty class nstances that are harder to learn compared to nstances that are easer to learn. Ths approach mproved learnng wth respect to the data dstrbutons on the mbalanced data sets by reducng the bas of class dstrbuton and by adaptvely shftng the decson boundary to put more attenton on nstances dffcult to learn. Generated synthetc nstance x syn One of the K-nearest neghbors (K=5), x knn Mnorty nstance, x x x ( x x ) syn knn s a random number between 0 and 1 Fgure 2.1 Synthetc over-samplng example by SMOTE algorthm Under-samplng Whle over-samplng adds nstances to the orgnal data set, under-samplng removes nstances from the majorty class whle keepng all nstances of the mnorty class due to rareness

16 8 of nformaton. A smple method for under-samplng the majorty class s random undersamplng, a non-heurstc method that balances class dstrbutons by selectng and removng majorty nstances randomly. Several heurstc under-samplng methods have been proposed from data cleanng n recent years. They are based on ether of two dfferent nose model hypotheses: one s that nstances near to a decson boundary between two classes are consdered nose, whle the other consders that nstances havng more neghbors from dfferent classes are nose. Snce random under-samplng leads to losng potentally useful data, some heurstc under-samplng methods try to remove superfluous nstances whch wll not affect the classfcaton accuracy of the tranng set. Hart (1968) ntroduced a tranng set condensaton algorthm, Condensed Nearest Neghbor Rule (CNN), n order to fnd a consstent subset of a sample set whch can correctly classfy all of the remanng nstances n the tranng set. The algorthm uses two bns, called S and T. Intally, the frst sample of the tranng set s placed n S, whle the remanng samples of the tranng set are placed n T. Then one pass through T s performed. Durng the scan, whenever a pont n T s msclassfed by usng S as the tranng set, t s transferred from T to S. After classfcaton, process s repeated untl no ponts are transferred from T to S. The motvaton for ths heurstc s that msclassfed data les close to the decson boundary. In the same manner, Tomek (1976) proposed an effectve method to elmnate data n the overlappng regons. Gven two nstances x and y that have a dfferent class label and are separated by a dstance d ( x, y), the par (x, y) s called a Tomek lnk f there s no nstance z such that d( x, z) d( x, y) or d( y, z) d( x, y). Instances partcpatng n Tomek lnks are consdered ether borderlne or nosy. Kubat and Matn (1999) proposed one-sded samplng (OSS) for detectng less relevant nstances for learnng. Ths technque s ntended to keep all

17 9 mnorty nstances snce they are rare, (even though some of them can be nosy) and nstead prune out only majorty nstances. Intally t starts wth a subset(c) of tranng set (S) that contans all mnorty nstances, C S, and usng a 1-Nearest Neghbor rule usng nstances n C, classfy the nstances n S. Afterwards, all msclassfed nstances are moved to C, then all majorty nstances partcpatng n Tomek lnks from C are removed snce they are beleved to be borderlne and/or nosy. Wlson (1972) ntroduced the Edted Nearest Neghbor Rule (ENN) to remove any nstance whose class label s dfferent from the class of at least two of ts three nearest neghbors. The dea behnd ths technque s to remove the nstances from the majorty class that are near or around the borderlne of dfferent classes based on the concept of nearest neghbor (NN) n order to ncrease classfcaton accuracy of mnorty nstances rather than majorty nstances Adjustng classfers to mbalanced data sets Rebalancng the data dstrbuton through ether over-samplng or under-samplng has had some success, but the methods are usually computatonally expensve. Also, changng class dstrbuton at the data level does not always lead to better classfcaton performance. A classfer s not always nfluenced by class dstrbutons. Drummond and Holte (2003) observed that over-samplng dd not produce effectve mprovement n performance or there was no change n classfcaton. On the contrary, over-samplng prunes less than under-samplng usng the default parameters for the C4.5 algorthm. A modfcaton of the parameter settngs of C4.5 mproved classfcaton performance and avoded the over-fttng problem durng over-samplng. Thus, whle samplng methods have tred to balance class dstrbuton by consderng the proportons of class nstances n the orgnal data dstrbuton, other approaches have been

18 10 ntroduced for mbalanced data learnng. One s the cost senstve method, whch uses a cost matrx to penalze msclassfcaton of nstances, as shown n Table 2.1. Typcally, no costs are appled to the correctly classfed cases and the cost of msclassfyng mnorty cases s hgher than that of majorty cases. The objectve of ths strategy s to mnmze the cost of msclassfcaton. In some applcatons, cost senstve technques have performed better than samplng methods (McCarthy et al., 2005 and Lu et al., 2006). Actual Class Predcted Class j Class 0 c j Class j c j 0 Table 2.1 Cost matrx MetaCost (Domngos, 1999) s another method related to cost-senstve learnng. It estmates class probabltes usng Baggng and then re-labels the tranng nstances wth ther mnmum expected classes, and n the end, relearns a model usng the modfed tranng set. Based on the weght update rule of AdaBoost (Freund & Schapre, 1997) for msclassfed nstances at an teratve learnng, Fan et al. (1999) proposed a dscrmnant weght update method for msclassfed nstances for mbalanced datasets, whch s called Adacost. Ther approach s to assgn larger weghts for msclassfed nstances belongng to the mnorty class than those belongng to the majorty class and as a result, Adacost has performed emprcally better n lowerng cumulatve msclassfcaton costs than AdaBoost.

19 11 Some classfers such as the Naïve Bayes classfer or some Neural Networks use a score to show the degree to whch an nstance belongs to a class. Ths type of rankng can be used n alternatve classfers by changng the threshold for an nstance belongng to a class (Wess, 2004). For basng the dscrmnaton procedure, Barandela et al. (2003) proposed a weghted dstance functon n classfcaton nstead of alterng the class dstrbutons n terms of a nearest neghbor (NN) classfer. Supposng that de() s the Eucldean metrc, x new a new nstance to classfy, x 0 a tranng sample from class, n the number of nstances of class and m the dmensonalty of the nput varable, a weghted dstance functon, dw() s defned as: d ( x, x ) ( n / n) d ( x, x ). Ths could assgn greater weghtng factors to majorty 1/ m w new 0 e new 0 nstances than mnorty nstances; consequently, producng smaller dstances to nstances of the mnorty class than dstances to those of the majorty class. As a result, the neghbors of the new nstances are found among the mnorty nstances, ncreasng the value of the geometrc mean (gmean). For the SVM classfcaton algorthm, ths basng approach pushes a hyperplane further away from the mnorty (postve) class for mbalanced datasets. Wu and Chang (2003) proposed a basng algorthm to change the kernel functon. Basng classfcaton algorthms n SVM usng larger penalty constants assocated wth the mnorty class made msclassfcaton errors for the mnorty nstances much costler than errors for majorty nstances (Veropoulos et al., 1999). Huang at el. (2004) proposed a Based Mnmax Probablty Machne (BMPM) to resolve learnng for mbalanced datasets. Gven the mean and covarance matrces of the majorty and mnorty classes, BMPM formulates an optmzaton problem to fnd the decson hyperplane by adjustng the lower bound of the accuracy for the classfcaton of the future data. For example, f the objectve functon s to maxmze the accuracy of classfcaton for the mnorty class, the

20 12 optmzaton tres to maxmze t by settng the lower bound of the classfcaton accuracy for both classes. Achevng the worst case accuracy for the mnorty can be avoded whle mantanng the acceptable accuracy level of the majorty class n mbalanced data learnng. One-class learnng s an alternatve to dscrmnaton where the model s created based on the nstances of the target class alone. The man dea s that boundares between two classes are estmated from data of one class (the target class) so that ths approach s not senstve to the class dstrbuton n the tranng set. A boundary around the target class s defned n such a way that most of the target objects are ncluded and at the same tme the chance of acceptng outler objects s mnmzed. For nstance, Kubat et al. (1998) ntroduced the SHRINK algorthm followng ths general prncple and appled t to detectng rare ol splls from satellte radar mages. The goal was to fnd the classfcaton rule that best dentfes the postve examples (ol splls) usng a g-mean measure. Assumng that the negatve (majorty) nstances outnumber postve (mnorty) nstances, the algorthm labeled the mxed regons as postve (mnorty). Ths alters the learner s focus: search for the best postve regon, one wth the maxmum rato of postves to negatves. Raskutt and Kowalczyk (2004) studed one-class learnng wth hghly mbalanced dataset learnng usng a SVM classfer. They showed that one-class learnng s useful for extremely mbalanced datasets wth a hgh dmensonal nosy feature space Ensemble learnng methods Ensemble learnng s motvated by the nformaton loss that occurs n under-samplng. In ensemble learnng, multple classfers are generated by tranng subsets from the orgnal dataset. In the end, the classfers are combned n a learnng process and the fnal classfcaton s determned by a votng scheme. Boostng and Baggng (Boostrap aggregatng) are the most

21 13 successful approaches. Most boostng algorthms use teratvely learnng weak classfers that have been produced by placng dfferent weghts on the tranng nstances. In each teraton, boostng ncreases weghts for ncorrectly classfed nstances and decreases weghts for correctly classfed ones, placng more attenton on the ncorrectly classfed nstances for the next teraton. Rare-Boost scales false-postve nstances n proporton to how well they are dfferentated from true-postve nstances and scales false-postve nstances n proporton to how well they are dstngushed from true-negatve nstances (Josh et al., 2001). SMOTEBoost (Chawla et al., 2003) addressed the ssue that boostng may produce an over-fttng problem as n over-samplng. Instead of updatng weghts to change dstrbutons of the tranng dataset, t adds new nstances of the mnorty class usng SMOTE. Chan and Stolfo (2001) proposed another ensemble method conceptually smlar to a Baggng approach. They conducted some prelmnary experments to dentfy a desred class dstrbuton that avods the class mbalance problem, and then resampled to make multple tranng sets based on the desred class dstrbuton. Each tranng set contaned all nstances of the mnorty class and a subset of the majorty nstances. To use all nstances of the majorty class, each majorty class nstance appeared n at least one tranng set. Fnally, the learnng algorthm was appled to each tranng set and a composte learner was created from the classfcaton results of all classfers. Recently, two algorthms, EasyEnsemble and BalanceCascade have been ntroduced (Lu et al., 2006). The strategy of these two methods s to make several tranng sets by keepng all the mnorty nstances and under-samplng several subsets from the majorty class. Wth replacement n samplng of the majorty class, they overcome potental of nformaton loss of the majorty class. EasyEnsemble ndependently samples (wth replacement) from the majorty class several subsets whose sze s equal to the sze of the mnorty class and generates the ndvdual

22 14 classfers for the subsets. In other words, EasyEnsemble generates T balanced tranng sets. The output of learnng the th tranng set s an AdaBoost classfer H ( = 1,..,T). Then all generated classfers, H =1,,T, are combned for the fnal decson. The BalanceCascade method reduces the sze of the majorty class teratvely, based on the most recent classfer. Ths algorthm uses a traned classfer to gude the samplng process for subsequent classfers. Intally, t samples a balanced tranng set lke EasyEnsemble. After the Adaboost ensemble s traned wth the ntal balanced tranng set, all majorty nstances that have been correctly classfed are removed from the majorty class. In ths manner, the majorty tranng set s reduced after every AdaBoost ensemble, H, s traned. Ths samplng strategy reduces the redundant nformaton of the majorty class and explores as much useful nformaton as possble. Besdes the methods already dscussed, other approaches have been used to address the class mbalance problem. For example, feature selecton was used to select mportant features for the mnorty and majorty classes separately and then explctly combne them (Zheng et al. 2004). 2.2 Performance Measures for Imbalanced Data Learnng Performance measures are used to assess the effectveness of learnng methods. In general, accuracy (or error rate) s the most common metrc for most classfcaton tasks and s gven by ( TP TN) Accuracy ( TP FN FP TN). (2.1) For a two class classfcaton problem, classfcaton performance s evaluated by a confuson matrx (contngency table) as seen n Table 2.2.

23 15 However, for a skewed class dstrbuton, accuracy s not sutable to evaluate mbalanced data learnng because the overall accuracy may be domnated by the classfcaton accuracy of the majorty class. For ths reason, other metrcs have been used, namely, precson, recall, F- measure, geometrc mean (g-mean) of the accuracy on the majorty class and the mnorty class, and the maxmum sum (MS). These metrcs are based on the confuson matrx (see Table 2.2). In ths research, postve and negatve correspond to the mnorty and majorty class, respectvely. Real Predcton Postve Negatve Postve TP (True Postve) FN (False Negatve) Negatve FP (False Postve) TN (True Negatve) Table 2.2 Confuson matrx for performance evaluaton Intutvely, precson s a measure of how many nstances were correctly labeled as postve and s calculated as TP precson. (2.2) ( TP FP) Recall s a measure of how many nstances of the postve class were labeled correctly and s defned as TP recall ( TP FN). (2.3) Unlke accuracy and error, precson and recall are both less senstve to changes n data dstrbutons. As an assessment of the accuracy for the postve class, precson s somewhat senstve to data dstrbutons, whle recall s not. Recall gves no nsght nto how many

24 16 nstances are ncorrectly classfed as postve. Smlarly, precson does not tell us how many postve nstances are ncorrectly classfed. Nevertheless, when used properly, precson and recall can effectvely evaluate classfcaton performance n mbalanced learnng scenaros. The F-measure metrc combnes precson and recall as a measure of the effectveness of classfcaton n terms of a rato of the weghted mportance on ether recall or precson as determned by the coeffcent set by the user and s gven by F measure 2 (1 ) recall precson 2 recall precson. (2.4) where s a coeffcent to adjust the relatve mportance of precson versus recall. As a result, F-measure provdes more nsght nto the functonalty of a classfer than the accuracy metrc. Another metrc, the g-mean evaluates the degree of nductve bas n terms of a rato of postve accuracy and negatve accuracy and s defned as TP TN g mean TP FN TN FP. (2.5) Maxmum sum (MS) s used as an evaluaton metrc that gves equal weght to the classfcaton accuracy of the postve and the negatve class and s gven by TP TN MS. (2.6) TP FN TN FP Recever Operatng Characterstc (ROC) analyss from sgnal detecton theory s also used as a metrc for mbalanced data learnng. The area under the ROC curve (AUC) assesses overall classfcaton performance (Bradley, 1997). AUC does not place more emphass on one class over the other, so t s not based aganst the mnorty class. In addton, Precson-Recall

25 17 (PR) curves (Davs & Goadrch, 2006) and Cost curves (Holte & Drummond, 2006) have been used n other to evaluate mbalanced dataset learnng of classfers and also vsualze the performance. 2.3 Summary and Research Scope Although prevous methods have n some cases produced satsfactory results for mbalanced data learnng, some of them may not be practcal to mplement or may conflct wth a specfc classfcaton learnng algorthm. Imbalanced data learnng methods need to consder performance and nteracton wth classfcaton algorthms. In ths research, we examne the class mbalance problem focusng on a specfc classfcaton algorthm, SVM and a comparson of methodologes from a perspectve of effectveness (the ablty to accurately classfy an unknown dataset) and effcency (the speed of classfyng data). Although SVM s more accurate on moderately mbalanced data compared wth other standard classfers, an SVM s also generally prone to generate a classfer that s extremely based toward the majorty class. To cope wth the mbalance dataset learnng wth a SVM, the prevously dscussed samplng strateges could be used, but would ntroduce sgnfcant degradaton n classfer performance. For example, whle over-samplng keeps all exstng nformaton of a learnng dataset and solves the class mbalance problem by addng nformaton of the mnorty class, processng of learnng tranng datasets could be costly f many nstances are over-sampled to handle mbalanced datasets. Recent work on mbalanced learnng wth a SVM has focused on mprovng classfcaton performance (Akban et al., 2004, Raskutt & Kowalczyk, 2004). However, the

26 18 effcency of mbalanced data learnng was not fully consdered. In ths research, we present a samplng methodology for the problem of class mbalance consderng both effectveness and effcency n learnng wth a SVM. In ths research, the base assumpton s that classfcaton accuraces of both classes are equally mportant.

27 19 CHAPTER 3 CLASS IMBALANCE PROBLEM WITH SUPPORT VECTOR MACHINE LEARNING 3.1 Support Vector Machne (SVM) Classfer A SVM uses a hypothess space of lnear functons n a hgh dmensonal feature space, traned wth a learnng algorthm from optmzaton theory that mplements a learnng bas derved from statstcal learnng theory. Ths learnng strategy was ntroduced by Vapnk (1995) and has been wdely used n the machne learnng communty due to ts theoretcal foundatons and practcal performance n applcatons rangng from mage retreval (Tong & Chang, 2001), handwrtng recognton (Cortes, 1995) to text classfcaton (Joachms, 1998). In classfcaton tasks, SVM tres to fnd an effcent way of learnng good separatng hyperplanes n a hgh dmensonal feature space that maxmzes the margn between the two classes. The smplest model of a SVM starts wth a maxmal margn classfer. It works only for lnearly separable cases n feature space, so t assumes that there s no tranng error. Generally t may not be appled to separaton of many real datasets. If the data are nosy, no separaton exsts n feature space. Nonetheless the maxmal margn classfer provdes key characterstcs of ths knd of learnng machne. Frst, ths can be vewed as a convex optmzaton problem: mnmzng a quadratc functon under lnear nequalty constrants. Suppose that we have a tranng dataset, x, y } for 1,..., l, where x s a vector n the nput space { N S R and y denotes the class label takng ether +1 or -1.

28 20 { x w x b 0} Orgn b w H 2 H 1 margn H1 :{ x w x b 1} H :{ x w x b 1} 2 Fgure 3.1 Lnear separatng hyperplanes for the separable case As shown n Fgure 3.1, the ponts x whch le on the hyperplanes satsfy w x b 0, where w s normal to the hyperplane, b / w s the perpendcular dstance from the hyperplane to the orgn, and w s the Eucldean norm of w. For the lnearly separable case, the support vector algorthm looks for the separatng hyperplane wth the largest margn. Suppose that all the tranng data meet the followng constrants. w x b 1 for y 1 (3.1) w x b 1 for y 1 (3.2) The margn s defned as dstance between two hyperplanes, H : w x b 1 and 1 H : w x b 1wth a normal w. Snce these two hyperplanes are parallel and have the same 2 normal, the margn dstance that separates them s gven by

29 w 2 w w 2 x w x w w 2 w x x (3.3) 1 w 2. Each nstance that falls on one of the two hyperplanes s called a support vector (SV). The SVs n Fgure 3.1 are crcled. Gven ths geometrc relatonshp, fndng the hyperplanes wth the maxmum margn n feature space can be formulated as a mathematcal programmng problem. For a lnearly separable tranng data S (( x1, y1),...,( x l. yl )), the hyperplane producng maxmum margn s found by formulatng the mnmzaton optmzaton problem as follows: mnmze subject to w 2 / 2 y ( w x b) 1, (3.4) 1,..., l We now swtch to a Lagrangan formulaton of the problem wth the Lagrange multplers, 0. By dong ths, the constrants n Equaton (3.4) are replaced by constrants on the Lagrangan multpler themselves, whch are easer to handle. The prmal Lagrangan s gven by l 1 2 LP ( w, b, ) w [ y ( w x b) 1], 0. (3.5) 2 1 Then we must mnmze L P ( w, b, ) wth respect to w and b, subject to 0. Ths s a convex quadratc programmng problem snce the objectve functon s tself convex, and those

30 22 ponts whch satsfy the constrants also form a convex set. Ths ndcates that ths problem can be equally solved wth ts correspondng dual problem, subject to the dervatves of ),, ( w b L P wth respect to w and b, l y x w w w b L 1 0 ),, (, l y x w 1 (3.6) l y b w b L 1 0 ),, (, l y 1 0 (3.7) also subject to the constrants, 0. Snce these are equalty constrants n the dual formulaton, the dual can be substtuted (3.6) nto (3.5) and results n ] ) ( [ 2 1 ),, ( 1, 1 1, 1 1, 1 2 l j j j j l l j l j j j l j j j j l D x x y y x x y y x x y y b w w y w b w L (3.8) The prmal ( P L ) and dual ( D L ) come from the same objectve functon but wth dfferent constrants and the soluton s found by mnmzng P L or maxmzng D L. Gven that we want to maxmze D L wth respect to, the optmzaton problem can be formulated as. 1,..., 0, 0, subject to 2 1 ) ( maxmze 1 1, 1 l y x x y y L l l j j j j l D (3.9)

31 23 In solvng ths problem, the postve values for each gves w * l 1 y x whch * * generates the maxmal margn hyperplane wth margn 1/ w. Those ponts whose s postve are the SVs (and are located on H 1 and H 2 n Fgure3.1), whle other ponts are zero. Snce the value of b does not appear n the dual problem, 2 * b s found makng use of the prmal constrants. The optmal solutons *, ( *, b * ) w must satsfy * * * [ y ( w x b 1] 0, 1,..., l. Ths s the Karush-Kuhn-Tucker (KKT) complementary optmalty condton. Wth ths condton, b can be computed. The hyperplane decson functon can then be wrtten as l * * f ( x) sgn( y x x b ). (3.10) 1 j Ths mples that support vectors are the crtcal ponts n the tranng set and le closest to the hyperplane producng the maxmum margn of two dfferent class labels. So far we have consdered only a separable case of tranng data. How can we extend these strateges to deal wth a non-separable case? Ths s done by ntroducng a postve slack varable,, 1,..., l. Constrants become: w x b 1 for y 1 (3.11) w x b 1 for y 1 (3.12)

32 24 { x w x b 0} 1 3 b w 2 H 2 4 H 1 margn H1 :{ x w x b 1} H :{ x w x b 1} 2 Fgure 3.2 Lnear separatng hyperplanes for the non-separable case So when an error occurs, the correspondng must exceed unty, so s the upper bound on the number of tranng errors. Hence a natural way to assgn an extra cost for errors s to change the objectve functon to be mnmzed from w / 2 (see equaton 3.4) to 2 w 2 l 1 / 2 C, where C s a penalty parameter chosen by the user. Ths s the concept behnd soft-margn SVMs. Introducng as the Lagrange multplers of the, the Lagrange (prmal) functon s: L P 1 2 w 2 C l 1 l [ y ( w x b) 1 ]. (3.13) 1 l 1 For mnmzng the Lagrange (prmal) problem wth respect to w, b, and, settng the respectve dervatves to zero, we get equatons (3.6), (3.7) above, and C,. (3.14)

33 25 By substtutng the dervatves nto the prmal problem, the correspondng dual problem s formulated as maxmze subject to L D ( ) l 1 1 y 0, 0 C, l 1 2 l, j1 1,..., l. y y x x j j j (3.15) Wth the dervatves and the KKT condton, equatons (3.16)-(3.18), we obtan { y ( x w b) 1 } 0 (3.16) 0 (3.17) y ( x w b) (1 ) 0 (3.18) The soluton s gven by w Ns 1 y x, where Ns s the number of SVs whch non-zero coeffcent ˆ. Among the SVs, some le on the edge of the margn ( ˆ 0 ) that s characterzed by 0 ˆ C from equatons (3.17) and (3.14) and the remanng ( ˆ 0 ) have ˆ C. Therefore, ponts on the wrong sde of the boundary are SVs and ponts on the correct sde of the boundary are also SVs. A SVM s a lnear classfer, but n most cases, t s practcally restrctve. SVM can be easly extended to a nonlnear classfer by mappng the nput space nto a hgh dmensonal feature space through a kernel functon, K x, x ) whch computes the dot product of the data ( j ponts n the feature space H, that s, K( x, x j ) ( x ). ( x j ). (3.19)

34 26 Functons that satsfy Mercer s theorem (Burges, 1998) can be used as dot products and thus can be used as kernels. Common kernel functons nclude Lnear: K( x, x ) ( x x ), Polynomal: j j K( x, x ) ( x x 1) d and Gaussan Radal-based : K( x, x ) e. j j j x x j Thus the nonlnear separatng hyperplane can be found formulated as an optmzaton problem gven by 2 / 2 2 maxmze subject to L D ( ) l 1 1 y 0, 0 C, l 1 2 l, j1 1,..., l. y y K( x j j x j ) (3.20) The correspondng decson functon s l f ( x) sgn( w z) b) sgn( y K( x x ) b). (3.21) 1 j 3.2 SVMs and the Skewed Boundary As noted prevously, mbalanced data sets cause a bas n the results of a SVM. Akban et al. (2004) have summarzed three reasons why a skewed boundary occurs n SVM classfcaton for mbalanced data sets: (1) postve (mnorty) nstances le further from the deal boundary compared wth negatve (majorty) nstances, (2) the weakness of the soft-margn SVMs and (3) the mbalanced SV rato. For the thrd reason, accordng to the KKT condtons n solvng the optmzaton problem n SVM, the values for must satsfy 1 y n 0. Snce the values for the mnorty class tend to be much larger than those for the majorty class and the number of mnorty SVs substantally smaller, the nearest neghborhood of a test pont s lkely to be

35 27 domnated by majorty SVs. That means that the decson functon s more lkely to classfy a boundary as majorty. The second reason, the weakness of the soft-margn SVMs, s an nherent weakness n copng wth mbalanced data learnng. For separable cases, the mbalance of class dstrbuton rarely nfluences the performance of SVMs because all the slack varables are equal to zero (equatons 3.11 and 3.12). Therefore, there s no contradcton between the capacty of the SVMs and the classfcaton error. However, for non-separable cases, soft-margn SVMs should acheve a trade-off between maxmzng the margn between two classes and mnmzng the classfcaton error. Typcally much more majorty nstances appear n the overlappng area than mnorty ones. So, the optmal hyperplane wll be skewed on the mnorty class sde n order to reduce the overwhelmng errors of msclassfyng the majorty class. If C s not very large, SVMs smply predct most of mnorty nstances as majorty nstances to make the margn as large as possble, makng the total msclassfcaton cost as small as possble. Several methods for SVMs for mbalanced data learnng have been studed (Karakoulas & Taylor, 1999; Ln et al., 2002). At the data level, rebalancng approaches such as oversamplng (.e. SMOTE) and under-samplng have been wdely used for SVM to cope wth mbalanced datasets. Veropoulos et al. (1999) suggested usng dfferent penaltes for msclassfcaton of the classes. Amar and Wu (1999) proposed usng conformal transformaton of the kernel matrx to enlarge the separaton between two classes. In the frst step, t fnds the separatng locaton between two classes through standard SVM learnng. In the second step, the prmary kernel matrx s conformally scaled to gve a wder separaton. Separaton s controlled by the SVs, so the new kernel matrx s enlarged at the poston of SVs. Another approach s the one-class SVM (Scholkoft and Smola, 2002). Ths uses only mnorty nstances for learnng. Its orgnal applcaton s detectng outlers that dffer from most

36 28 of the data wthn a dataset. Ths determnes a hyperplan n feature space that separates most of the data from the orgn. It completely gnores nformaton of the majorty class: Instead, only usng one class, the mnorty class, t defnes a hyperplane that separates most of data belongng to the class from the orgn. 3.3 Problems assocated wth SVM classfer for mbalanced data When focusng on approaches at the data level (rebalancng the data dstrbuton), there are two sgnfcant problems assocated wth a SVM classfer, namely, 1. Over-samplng methods sgnfcantly ncrease the dataset sze. 2. An optmal rato of class dstrbuton s emprcally determned by grd search. To address these problems, we propose a new samplng method at the data level for mbalanced data learnng. Instead of rebalancng the entre mbalanced dataset, a selectve samplng method s proposed that results n a relatvely small number of nstances. We expect that a small set of representatve nstances of an mbalanced dataset could determne the desred decson boundary by mantanng the same or achevng even better performance for the class mbalance problem as compared to exstng rebalancng methods. The merts of ths approach nclude: (1) skppng emprcal search that s necessary n samplng methods such as optmal ratos of class dstrbutons and (2) avodng producng a large tranng set that would lead to long tranng tmes for a SVM. If ths approach performs well as compared to some current methods, t wll provde an alternatve method to solve the class mbalance problem wth the advantage of reducng learnng tme for a SVM.

37 Effectveness of rebalancng class dstrbuton For mbalanced and hghly overlapped class data, samplng methods such as oversamplng or under-samplng are very effectve n terms of the optmzaton process n a softmargn SVM. In order to llustrate the effect of samplng methods n rebalancng class dstrbuton, we examne the boundary movement n two common methods, SMOTE oversamplng and Random under-samplng. In ths example, a Gaussan kernel functon s used for SVM classfcaton. Frst, we generated a smple structure of a synthetc dataset showng a typcal class mbalance problem, whch could be represented n 2-dmensonal space as shown n Fgure 3.3. The mnorty class havng 40 nstances s marked wth o and the majorty havng 400 nstances wth (a class rato of 1:10) Fgure 3.3 Example of class mbalance problem on SVMs

38 30 After classfcaton, almost all majorty nstances were correctly classfed, whle many mnorty nstances were classfed as the majorty class. Ths example llustrates a typcal class mbalance problem caused by soft-margn SVM algorthms. In other words, an optmal hyperplane results from the trade-off between maxmzng the margn of the mnorty and the majorty class and mnmzng msclassfcaton costs n the feature space. To mprove the accuracy for the mnorty class, we need to move the boundary toward the majorty class sde. To llustrate ths, we appled two rebalancng samplng methods, SMOTE and random undersamplng, whch wll be referred to as SVM-SMOTE and SVM-RU, respectvely. Usng SVM-SMOTE, the number of synthetc nstances to acheve the desred class balance s unknown and emprcal studes must be performed. Mnorty nstances are oversampled gradually wth 100%, 300%, 500% and 1000% ncreases n mnorty nstances. After rebalancng by SMOTE, we observed that the boundary gradually shfted toward the majorty class as mnorty nstances are ncreased as shown n Fgure 3.4 (a) to (f) (a) 100% ncrement (b) 200% ncrement

39 (c) 500% ncrement (d). 1000% ncrement (e) 1700% ncrement (f) 2000% ncrement crcle( ): mnorty nstances, dot( ): majorty nstances, cross(+): synthetc nstances by SMOTE Fgure 3.4 Boundary movements by SMOTE algorthm Though SVM-SMOTE shfts the decson boundary, t comes at a penalty of ncreasng the sze of the dataset as mentoned n the prevous secton. Assume that Np s the number of the postve (mnorty) nstances and Nn the number of the negatve (majorty) nstances, typcally SVM takes O Np 3 (( Nn) ) tme for learnng n the worst case (Burges, 1998). For mbalanced data learnng, SVM-SMOTE wll take 3 O(( Np (1 Rsmote) Nn) ) where Rsmote s an optmal rato

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.