Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset

Under-Samplng Approaches for Improvng Predcton of the Mnorty Class n an Imbalanced Dataset Show-Jane Yen and Yue-Sh Lee Department of Computer Scence and Informaton Engneerng, Mng Chuan Unversty 5 The-Mng Rd., Gwe Shan Dstrct, Taoyuan County 333, Tawan {sjyen, leeys}@mcu.edu.tw Abstract. The most mportant factor of classfcaton for mprovng classfcaton accuracy s the tranng data. However, the data n real-world applcatons often are mbalanced class dstrbuton, that s, most of the data are n majorty class and lttle data are n mnorty class. In ths case, f all the data are used to be the tranng data, the classfer tends to predct that most of the ncomng data belong to the majorty class. Hence, t s mportant to select the sutable tranng data for classfcaton n the mbalanced class dstrbuton problem. In ths paper, we propose cluster-based under-samplng approaches for selectng the representatve data as tranng data to mprove the classfcaton accuracy for mnorty class n the mbalanced class dstrbuton problem. The expermental results show that our cluster-based under-samplng approaches outperform the other under-samplng technques n the prevous studes. 1 Introducton Classfcaton Analyss [5, 7] s a well-studed technque n data mnng and machne learnng domans. Due to the forecastng characterstc of classfcaton, t has been used n a lot of real applcatons, such as flow-away customers and credt card fraud detectons n fnance corporatons. Classfcaton analyss can produce a class predctng system (or called a classfer) by analyzng the propertes of a dataset havng classes. The classfer can make class forecasts on new samples wth unknown class labels. For example, a medcal offcer can use medcal predctng system to predct f a patent have drug allergy or not. A dataset wth gven class can be used to be a tranng dataset, and a classfer must be traned by a tranng dataset to have the capablty for class predcton. In bref, the process of classfcaton analyss s ncluded n the follow steps: 1. Sample collecton. 2. Select samples and attrbutes for tranng. 3. Tran a class predctng system usng tranng samples. 4. Use the predctng system to forecast the class of ncomng samples. The classfcaton technques usually assume that the tranng samples are unformly-dstrbuted between dfferent classes. A classfer performs well when the classfcaton technque s appled to a dataset evenly dstrbuted among dfferent D.-S. Huang, K. L, and G.W. Irwn (Eds.): ICIC 2006, LNCIS 344, pp. 731 740, 2006. Sprnger-Verlag Berln Hedelberg 2006

732 S.-J. Yen and Y.-S. Lee classes. However, many datasets n real applcatons nvolve mbalanced class dstrbuton problem [9, 11]. The mbalanced class dstrbuton problem occurs whle there are much more samples n one class than the other class n a tranng dataset. In an mbalanced dataset, the majorty class has a large percent of all the samples, whle the samples n mnorty class just occupy a small part of all the samples. In ths case, a classfer usually tends to predct that samples have the majorty class and completely gnore the mnorty class. Many applcatons such as fraud detecton, ntruson preventon, rsk management, medcal research often have the mbalanced class dstrbuton problem. For example, a bank would lke to construct a classfer to predct that whether the customers wll have fducary loans n the future or not. The number of customers who have had fducary loans s only two percent of all customers. If a fducary loan classfer predcts that all the customers never have fducary loans, t wll have a qute hgh accuracy as 98 percent. However, the classfer can not fnd the target people who wll have fducary loans wthn all customers. Therefore, f a classfer can make correct predcton on the mnorty class effcently, t wll be useful to help corporatons make a proper polcy and save a lot of cost. In ths paper, we study the effects of undersamplng [1, 6, 10] on the backpropagaton neural network technque and propose some new under-samplng approaches based on clusterng, such that the nfluence of mbalanced class dstrbuton can be decreased and the accuracy of predctng the mnorty class can be ncreased. 2 Related Work Snce many real applcatons have the mbalanced class dstrbuton problem, researchers have proposed several methods to solve ths problem. As for re-samplng approach, t can be dstngushed nto over-samplng approach [4, 9] and undersamplng approach [10, 11]. The over-samplng approach ncreases the number of mnorty class samples to reduce the degree of mbalanced dstrbuton. One of the famous over-samplng approaches s SMOTE [2]. SMOTE produces synthetc mnorty class samples by selectng some of the nearest mnorty neghbors of a mnorty sample whch s named S, and generates new mnorty class samples along the lnes between S and each nearest mnorty neghbor. SMOTE beats the random oversamplng approaches by ts nformed propertes, and reduce the mbalanced class dstrbuton wthout causng overfttng. However, SMOTE blndly generate synthetc mnorty class samples wthout consderng majorty class samples and may cause overgeneralzaton. On the other hand, snce there are much more samples of one class than the other class n the mbalanced class dstrbuton problem, under-samplng approach s supposed to reduce the number of samples whch have the majorty class. Assume n a tranng dataset, MA s the sample set whch has the majorty class, and MI s the other set whch has the mnorty class. Hence, an under-samplng approach s to decrease the skewed dstrbuton of MA and MI by lowerng the sze of MA. Generally, the performances of under-samplng approaches are worse than that of undersamplng approaches.

Under-Samplng Approaches for Improvng Predcton 733 One smple method of under-samplng s to select a subset of MA randomly and then combne them wth MI as a tranng set, whch s called random under-samplng approach. Several advanced researches are proposed to make the selectve samples more representatve. The under-samplng approach based on dstance [11] uses dstnct modes: the nearest, the farthest, the average nearest, and the average farthest dstances between MI and MA, as four standards to select the representatve samples from MA. For every mnorty class sample n the dataset, the frst method nearest calculates the dstances between all majorty class samples and the mnorty class samples, and selects k majorty class samples whch have the smallest dstances to the mnorty class sample. If there are n mnorty class samples n the dataset, the nearest approach would fnally select k n majorty class samples (k 1). However, some samples wthn the selected majorty class samples mght duplcate. Smlar to the nearest approach, the farthest approach selects the majorty class samples whch have the farthest dstances to each mnorty class samples. For every majorty class samples n the dataset, the thrd method average nearest calculates the average dstance between one majorty class sample and all mnorty class samples. Ths approach selects the majorty class samples whch have the smallest average dstances. The last method average farthest s smlar to the average nearest approach; t selects the majorty class samples whch have the farthest average dstances wth all the mnorty class samples. The above under-samplng approaches based on dstance n [11] spend a lot of tme selectng the majorty class samples n the large dataset, and they are not effcent n real applcatons. In 2003, J. Zhang and I. Man [10] presented the compared results wthn four nformed under-samplng approaches and random under-samplng approach. The frst method NearMss-1 selects the majorty class samples whch are close to some mnorty class samples. In ths method, majorty class samples are selected whle ther average dstances to three closest mnorty class samples are the smallest. The second method NearMss-2 selects the majorty class samples whle ther average dstances to three farthest mnorty class samples are the smallest. The thrd method NearMss- 3 take out a gven number of the closest majorty class samples for each mnorty class sample. Fnally, the fourth method Most dstant selects the majorty class samples whose average dstances to the three closest mnorty class samples are the largest. The fnal expermental results n [10] showed that the NearMss-2 approach and random under-samplng approach perform the best. 3 Our Approaches In ths secton, we present our approach SBC (under-samplng Based on Clusterng) whch focuses on the under-samplng approach and uses clusterng technques to solve the mbalanced class dstrbuton problem. Our approach frst clusters all the tranng samples nto some clusters. The man dea s that there are dfferent clusters n a dataset, and each cluster seems to have dstnct characterstcs. If a cluster has more majorty class samples and less mnorty class samples, t wll behave lke the majorty class samples. On the opposte, f a cluster has more mnorty class samples and less majorty class samples, t doesn t hold the characterstcs of the majorty class samples and behaves more lke the mnorty class samples. Therefore, our

734 S.-J. Yen and Y.-S. Lee approach SBC selects a sutable number of majorty class samples from each cluster by consderng the rato of the number of majorty class samples to the number of mnorty class samples n the cluster. 3.1 Under-Samplng Based on Clusterng Assume that the number of samples n the class-mbalanced dataset s N, whch ncludes majorty class samples (MA) and mnorty class samples (MI). The sze of the dataset s the number of the samples n ths dataset. The sze of MA s represented as Sze MA, and Sze MI s the number of samples n MI. In the class-mbalanced dataset, Sze MA s far larger than Sze MI. For our under-samplng method SBC, we frst cluster all samples n the dataset nto K clusters. The number of majorty class samples and the number of mnorty class samples n the th cluster (1 K) are Sze MA and Sze MI, respectvely. Therefore, the rato of the number of majorty class samples to the number of mnorty class samples n the th cluster s Sze MA / Sze MI. If the rato of Sze MA to Sze MI n the tranng dataset s set to be m:1, the number of selected majorty class samples n the th cluster s shown n expresson (1): SzeMA SzeMI SSzeMA = ( m SzeMI) K Sze MA = 1 SzeMI In expresson (1), m SzeMI s the total number of selected majorty class samples K that we suppose to have n the fnal tranng dataset. Sze MA s the total rato of = 1 SzeMI the number of majorty class samples to the number of mnorty class samples n all clusters. expresson (1) determnes that more majorty class samples would be selected n the cluster whch behaves more lke the majorty class samples. In other words, SSze MA s larger whle the th cluster has more majorty class samples and less mnorty class samples. After determnng the number of majorty class samples whch are selected n the th cluster, 1 K, by usng expresson (1), we randomly choose majorty class samples n the th cluster. The total number of selected majorty class samples s m Sze MI after mergng all the selected majorty class samples n each cluster. At last, we combne the whole mnorty class samples wth the selected majorty class samples to construct a new tranng dataset. Table 1 shows the steps for our under-samplng approach. For example, assume that an mbalanced class dstrbuton dataset has totally 1100 samples. The sze of MA s 1000 and the sze of MI s 100. In ths example, we cluster ths dataset nto three clusters. Table 2 shows the number of majorty class samples Sze MA, the number of mnorty class samples Sze MI, and the rato of Sze MA to Sze MI for the th cluster. (1)

Under-Samplng Approaches for Improvng Predcton 735 Table 1. The structure of the under-samplng based on clusterng approach SBC Step1. Step2. Step3. Step4. Determne the rato of Sze MA to Sze MI n the tranng dataset. Cluster all the samples n the dataset nto some clusters. Determne the number of selected majorty class samples n each cluster by usng expresson (1), and then randomly select the majorty class samples n each cluster. Combne the selected majorty class samples and all the mnorty class samples to obtan the tranng dataset. Table 2. Cluster descrptons Cluster ID Number of majorty Number of mnorty MA MI class samples class samples 1 500 10 500/10=50 2 300 50 300/50=6 3 200 40 200/40=5 Assume that the rato of Sze MA to Sze MI n the tranng data s set to be 1:1, n other words, there are 100 selected majorty class samples and the whole 100 mnorty class samples n ths tranng dataset. The number of selected majorty class samples n each cluster can be calculated by expresson (1). Table 3 shows thenumber of selected majorty class samples n each cluster. We fnally select the majorty samples randomly from each cluster and combne them wth the mnorty samples to form the new dataset. Table 3. The number of selected majorty class samples n each cluster Cluster ID The number of selected majorty class samples 1 1 100 50 / (50+6+5) =82 2 1 100 6 / (50+6+5) = 10 3 1 100 5 / (50+6+5)= 8 3.2 Under-Samplng Based on Clusterng and Dstances In SBC method, all the samples are clustered nto several clusters and the number of selected majorty class samples s determned by expresson (1). Fnally, the majorty class samples are randomly selected from each cluster. In ths secton, we propose other fve under-samplng methods, whch are based on SBC approach. The dfference between the fve proposed under-samplng methods and SBC method s the way to select the majorty class samples from each cluster. For the fve proposed methods, the majorty class samples are selected accordng to the dstances between the majorty class samples and the mnorty class samples n each cluster. Hence, the dstances

736 S.-J. Yen and Y.-S. Lee between samples wll be computed. For a contnuous attrbute, the values of all samples for ths attrbute need to be normalzed n order to avod the effect of dfferent scales for dfferent attrbutes. For example, suppose A s a contnuous attrbute. In order to normalze the values of attrbute A for all the samples, we frst fnd the maxmum value Max A and the mnmum value Mn A of A for all samples. To le an a MnA attrbute value a n between 0 to 1, a s normalzed to. For a categorcal MaxA MnA or dscrete attrbute, the dstance between two attrbute values x 1 and x 2 s 0 (.e. x 1 - x 2 =0) whle x 1 s not equal to x 2, and the dstance s 1 (.e. x 1 -x 2 =1) whle they are the same. X Assume that there are N attrbutes n a dataset and V represents the value of attrbute A n sample X, for 1 N. The Eucldean dstance between two samples X and Y s shown n expresson (2). N X Y 2 dstance ( X, Y ) = ( V V ) (2) = 1 The fve approaches we proposed n ths secton frst cluster all samples nto K (K 1) clusters as well, and determne the number of selected majorty class samples for each cluster by expresson (1). For each cluster, the representatve majorty class samples are selected n dfferent ways. The frst method SBCNM-1 (Samplng Based on Clusterng wth NearMsss-1) selects the majorty class samples whose average dstances to M nearest mnorty class samples (M 1) n the th cluster (1 K) are the smallest. In the second method SBCNM-2 (Samplng Based on Clusterng wth NearMsss-2), the majorty class samples whose average dstances to M farthest mnorty class samples n the th cluster are the smallest wll be selected. The thrd method SBCNM-3 (Samplng Based on Clusterng wth NearMsss-3) selects the majorty class samples whose average dstances to the closest mnorty class samples n the th cluster are the smallest. In the forth method SBCMD (Samplng Based on Clusterng wth Most Dstant), the majorty class samples whose average dstances to M closest mnorty class samples n the th cluster are the farthest wll be selected. For the above four approaches, we refer to [10] for selectng the representatve samples n each cluster. The last proposed method, whch s called SBCMF (Samplng Based on Clusterng wth Most Far), selects the majorty class samples whose average dstances to all mnorty class samples n the cluster are the farthest. 4 Expermental Results For our experments, we use three crtera to evaluate the classfcaton accuracy for mnorty class: the precson rate P, the recall rate R, and the F-measure for mnorty class. Generally, for a classfer, f the precson rate s hgh, then the recall rate wll be low, that s, the two crtera are trade-off. We cannot use one of the two crtera

Under-Samplng Approaches for Improvng Predcton 737 to evaluate the performance of a classfer. Hence, the precson rate and recall rate are combned to form another crteron F-measure, whch s shown n expresson (3). 2 P R MI s F-measure = (3) P + R In the followng, we use the three crtera dscussed above to evaluate the performance of our approaches SBC, SBCNM-1, SBCNM-2, SBCNM-3, SBCMD, and SBCMF by comparng our methods wth the other methods AT, RT, and NearMss-2. The method AT uses all samples to tran the classfers and does not select samples. RT s the most common-used random under-samplng approach and t selects the majorty class samples randomly. The last method NearMss-2 s proposed by J. Zhang and I. Man [10], whch has been dscussed n secton 2. The two methods RT and Near- Mss-2 have the better performance than the other proposed methods n [10]. In the followng experments, the classfers are constructed by usng the artfcal neural network technque n IBM Intellgent Mner for Data V8.1. Method Table 4. The expermental results on Census-Income Database MI s Precson MI s Recall MI s F-measure MA s Precson MA s Recall MA s F-measure SBC 47.78 88.88 62.15 94.84 67.79 79.06 RT 30.29 99.73 46.47 99.63 23.92 38.58 AT 35.1 98.7 51.9 98.9 39.5 43.8 NearMss-2 46.3 81.23 58.98 91.70 68.77 78.60 SBCNM-1 29.28 99.80 45.28 99.67 20.07 33.41 SBCNM-2 29.6 99.67 45.64 99.49 21.39 35.21 SBCNM-3 28.72 99.8 44.61 99.63 17.9 30.35 SBCMD 29.01 99.73 44.94 99.54 19.05 31.99 SBCMF 43.15 93.48 59.04 96.47 59.15 73.34 We compare our approaches wth the other under-samplng approaches n two real datasets. One of the real datasets s named Census-Income Database, whch s from UCI Knowledge Dscovery n Databases Archve. Census-Income Database contans census data whch are extracted from the 1994 and 1995 current populaton surveys managed by the U.S. Census Bureau. The bnary classfcaton problem n ths dataset s to determne the ncome level for each person represented by the record. The total number of samples after cleanng the ncomplete data s 30162, ncludng 22654 majorty class samples whch the ncome level are less than 50K dollars and 7508 mnorty class samples whch the ncome level are greater than or equal to 50K dollars. We use eghty percent of the samples to tran the classfers and twenty percent to evaluate the performances of the classfers. The precson rate, recall rate, and F-measure for our approaches and the other approaches are shown n Table 4. Fg 1 shows

738 S.-J. Yen and Y.-S. Lee Executon tme (mn.) 200 180 160 140 120 100 80 60 40 20 0 SBC RT AT NearMss-2 SBCNM-1 SBCNM-2 SBCNM-3 SBCMD SBCMF Methods Fg. 1. The executon tme on Census-Income Database for each method the executon tme for each method, whch ncludes selectng the tranng data and tranng the classfer. In Table 4, we can observe that our method SBC has the hghest MI s F-measure and MA s F-measure whle comparng wth other methods. Besdes, SBC only need to take a short executon tme whch s shown n Fg 1. The other real dataset n our experment s conducted by a bank and s called Overdue Detecton Database. The records n Overdue Detecton Database contan the nformaton of customers, the statuses of customers payment, the amount of money n customers blls, and so on. The purpose of ths bnary classfcaton problem s to detect the bad customers. The bad customers are the mnortes wthn all customers and they do not pay ther blls before the deadlne. We separate Overdue Detecton Database nto two subsets. The dataset extracted from November n 2004 are used for tranng the classfer and the dataset extracted from December n 2004 are used for testng task. The total number of samples n the tranng data of Overdue Detecton Database s 62309, ncludng 47707 majorty class samples whch represent the good customers and 14602 mnorty class samples whch represent the bad customers. The total number of samples n the testng data of Overdue Detecton Database s 63532, ncludng 49931 majorty class samples and 13601 mnorty class samples. Fg 2 shows the precson rate, the recall rate and the F-measure of mnorty class for each approach. From Fg 2, we can see that our approaches SBC and SBCMD have the best MI s F-measure. Fg 3 shows the executon tmes for all the approaches n Overdue Detecton Database. In the two real applcatons whch nvolve the mbalanced class dstrbuton problem, our approach SBC has the best performances on predctng the mnorty class samples. Moreover, SBC takes less tme for selectng the tranng samples than the other approaches NearMss-2, SBCNM-1, SBCNM-2, SBCNM-3, SBCMD, and SBCMF.

Under-Samplng Approaches for Improvng Predcton 739 Fg. 2. The Expermental Results on Overdue Detecton Database Executon tme (mn.) 200 180 160 140 120 100 80 60 40 20 0 SBC RT AT NearMss-2 SBCNM-1 SBCNM-2 SBCNM-3 SBCMD SBCMF Methods Fg. 3. Executon tme on Overdue Detecton Database for each method 5 Concluson In a classfcaton task, the effect of mbalanced class dstrbuton problem s often gnored. Many studes [3, 7] focused on mprovng the classfcaton accuracy but dd not consder the mbalanced class dstrbuton problem. Hence, the classfers whch are constructed by these studes lose the ablty to correctly predct the correct dec-

740 S.-J. Yen and Y.-S. Lee son class for the mnorty class samples n the datasets whch the number of majorty class samples are much greater than the number of mnorty class samples. Many real applcatons, lke rarely-seen dsease nvestgaton, credt card fraud detecton, and nternet ntruson detecton always nvolve the mbalanced class dstrbuton problem. It s hard to make rght predctons on the customers or patents who that we are nterested n. In ths study, we propose cluster-based under-samplng approaches to solve the mbalanced class dstrbuton problem by usng backpropagaton neural network. The other two under-samplng methods, Random selecton and NearMss-2, are used to be compared wth our approaches n our performance studes. In the experments, our approach SBC has better predcton accuracy and stablty than other methods. SBC not only has hgh classfcaton accuracy on predctng the mnorty class samples but also has fast executon tme. However, SBCNM-1, SBCNM-2, SBCNM-3, and SBCMF do not have stable performances n our experments. The four methods take more tme than SBC on selectng the majorty class samples as well. References 1. Chawla, N. V.: C4.5 and Imbalanced Datasets: Investgatng the Effect of Samplng Method, Probablstc Estmate, and Decson Tree Structure. Proceedngs of the ICML 03 Workshop on Class Imbalances, (2003) 2. Chawla, N. V., Bowyer, K.W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: Synthetc Mnorty Over-Samplng Technque. Journal of Artfcal Intellgence Research, 16 (2002) 321 357 3. Caragea, D., Cook, D., Honavar, V.: Ganng Insghts nto Support Vector Machne Pattern Classfers Usng Projecton-Based Tour Methods. Proceedngs of the KDD Conference, San Francsco, CA (2001) 251-256 4. Chawla, N. V., Lazarevc, A., Hall, L. O., Bowyer, K. W.: Smoteboost: Improvng Predcton of the Mnorty Class n Boostng. Proceedngs of the Seventh European Conference on Prncples and Practce of Knowledge Dscovery n Databases, Dubrovnk, Croata (2003) 107-119 5. Clark, P., Nblett, T.: The CN2 Inducton Algorthm. Machne Learnng, 3 (1989) 261-283 6. Drummond, C., Holte, R. C.: C4.5, Class Imbalance, and Cost Senstvty: Why Under- Samplng Beats Over-Samplng. Proceedngs of the ICML 03 Workshop on Learnng from Imbalanced Datasets, (2003) 7. Del-Hoyo, R., Buldan, D., Marco, A.: Supervsed Classfcaton wth Assocatve SOM. Lecture Notes n Computer Scence, 2686 (2003) 334-341 8. Japkowcz, N.: Concept-learnng n the Presence of Between-class and Wthn-class Imbalances. Proceedngs of the Fourteenth Conference of the Canadan Socety for Computatonal Studes of Intellgence, (2001) 67-77 9. Zhang, J., Man, I.: KNN Approach to Unbalanced Data Dstrbutons: A Case Study Involvng Informaton Extracton. Proceedngs of the ICML 2003 Workshop on Learnng from Imbalanced Datasets, (2003). 10. Chy, Y. M.: Classfcaton Analyss Technques for Skewed Class Dstrbuton Problems. Master Thess, Department of Informaton Management, Natonal Sun Yat-Sen Unversty, (2003)