1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 OKAN K. ERSOY School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 ABSTRACT It s common to tran a classfer wth a tranng set, and to test t wth a testng set to study the classfcaton accuracy. In ths paper, we show how to effectvely use a number of valdaton sets obtaned from the orgnal tranng data to mprove the performance of a classfer. The proposed valdaton boostng algorthm s llustrated wth a support vector machne (SVM) n the applcaton of lymphography classfcaton. A number of runs wth the algorthm s generated to show ts robustness as well as to generate consensus results. At each run, a number of valdaton datasets are generated by randomly pckng a porton of the orgnal tranng dataset. At each teraton durng a run, the traned classfer s used to classfy the current valdaton dataset. The msclassfed valdaton vectors are added to the tranng set for the next teraton. Every tme the tranng set s changed, new classfcaton borders are generated wth the classfer used. Expermental results on a lymphography dataset shows that the proposed method wth valdaton boostng can acheve much better generalzaton performance wth a testng set than the case wthout valdaton boostng. INTRODUCTION Machne learnng has been used n cancer predcton and prognoss for nearly 20 years (Cruz and Wshart 2006). There are several methods wdely used for ths purpose, such as decson tree, Naïve Bayes, k-nearest neghbor, neural network and support vector machne algorthms. New algorthms are stll developed to mprove the classfcaton accuracy. One approach s to utlze feature extracton to select fewer features to tran the classfer. Baggng and boostng technques to generate dfferent tranng samples are also utlzed for ths purpose. In ths way, a number of dfferent classfers can be generated, and consensus technques such as majorty votng and least squares estmaton-based weghtng (Km 2003) can be used to acheve better and more stable classfcaton accuracy. In Baggng (Breman 1996), several classfers are traned ndependently va a bootstrap method, and ther results are combned together to obtan the fnal decson. In ths procedure, a sngle tranng set TR={(x ;y ) =1,2,,n} s used to generate K dfferent classfers. In order to get K dfferent tranng sets and make them ndepent of each other, the orgnal tranng set s resampled. The
2 new K tranng sets have the same sze as the orgnal dataset, but some nstances may appear more than once, and some nstances may not be n a new resampled tranng set. The adaboost algorthm by Freund and Schapre (1994, 1996, 1997) s generally consdered as a frst step towards more practcal boostng algorthms. A boostng algorthm defnes dfferent dstrbutons over the tranng samples, and uses a weak learner to generate hypotheses wth respect to the generated dstrbutons. From the dfferent dstrbutons of tranng samples, dfferent classfers are generated, and they are next combned wth dfferent weghts to get the fnal results. Although the resamplng technque of our proposed method s smlar to baggng and boostng, our appoach s dfferent snce we utlze a number of valdaton sets obtaned from the tranng set, and they are utlzed to modfy the tranng sets. Intally, we dvde the orgnal tranng data nto 2 groups, one for tranng, and the other for valdaton. We use the tranng porton to tran the classfer, and then we valdate t wth the valdaton set. The msclassfed valdaton samples are added to the current tranng set to generate the next tranng set. At each teraton, the current valdaton set s regenerated as a randomly chosen part of the orgnal tranng dataset wth a fxed percentage. At each run, the procedure s repeated n several teratons untl the valdaton accuracy reaches ts maxmum. At ths pont, a classfer s generated. Due to the random ntalzaton of the tranng set and the valdaton set at each run, dfferent ndependent classfers are obtaned wth a number of runs. The results from dfferent runs can be combned by a consensus rule such as majorty votng to get the fnal results. DATASET Lymphography data s obtaned from the UCI machne learnng repostory (Kononenko and Cestnk 1988). The examples n ths data set use 18 attrbutes, wth four possble fnal dagnostc classes. The attrbutes nclude lympho node dmenson, number of nodes, types of lymphatcs, etc. For convenent represantaton, the attrbutes are transformed to nteger type. There are a total of 148 samples, 2 are normal, 81 are metastases, 61 are malgnant lymph and 4 are fbross. Because normal and fbross cases are scarce compared to the other two cases, we used 142 samples to classfy whether the sample s metastases or malgnant lymph. SUPPORT VECTOR MACHINES Vapnk nvented SVM s wth a kernel functon n the 1990s (Vapnk 1992). Ths algorthm s ntally desgned for the two-class classfcaton problem. One class output s marked as 1, and the other class output s marked as -1. The algorthm tres to fnd the best separatng hyperplane wth the largest margn wdth. By gettng a better hyperplane from the tranng samples, t s expected to get better testng accuracy. In SVM, the hyperplane of the nonseparable case s determned by solvng the followng equaton:
3 1 2 mn w + C 2 ξ subject to T y( x w+ b) 1 ξ ξ 0 where x s the th data vector, y s the bnary (-1 or 1) class label of the th data vector, ξ s the slack varable, w s the weght vector normal to the hyperplane, C s the regularzaton parameter, and b s the bas. It can be shown that the margn wdth s equal to 2/ w. Usually the orgnal data s mapped by usng a kernel functon to a hgher dmensonal representaton before classfcaton. Some common kernel functons are lnear, polynomal, radal bass and sgmod functons. In our case, we used the radal bass functon gven by K x x = C x x (2) 2 (, ) *exp( γ * ) In the experments conducted, the SVM-Lght (Joachms, 2004) software was utlzed. We pcked γ equal to 1 and C equal to 1 n these experments. (1) TRAINING AND VALIDATION RESAMPLING TECHNIQUE In the tranng phase, we ntally decde the percentage of the tranng set as p tran and the percentage of the valdaton set as p val. We dvde the orgnal tranng set nto 2 groups accordng to p tran and p val. These 2 ntal tranng and valdaton sets do not overlap wth each other. The tranng set s used to tran the classfer, whch s next valdated wth the valdaton set (Fgure 1). Then, the msclassfed valdaton samples are ncluded n the tranng set to generate the next teraton tranng set. In the next teraton, the valdaton set s randomly pcked from the orgnal complete tranng set wth percentage p val. Wth the new tranng set and the valdaton set, other msclassfed valdaton samples are generated and are ncluded n the tranng set to generate the next teraton tranng set. After several teratons, the performance of the classfer traned n ths way becomes better than that of the classfer traned wth all the orgnal tranng set wthout any valdaton set. The teratons are stopped after reachng nearly 100% valdaton accuracy. Fgure 1. The msclassfed valdaton samples are added to the tranng samples of the prevous stage.
4 The proposed method emphaszes msclassfed valdaton samples. If a msclassfed sample s stll msclassfed the next tme, t would be reemphaszed, resultng n the followng weghtng: 2 1 x, ( ) 1, ( ) msclassfed x + x correclyclassfed x (3) where 1, x > type 1 x, ( ) type x = (4) 0, x > other type Due to the random ntalzatons of the tranng and the valdaton set, we get dfferent classfers at each run. In order to get better results, we can use consensus such as majorty votng between these classfers. EXPERIMENTS We ntally pcked 50% of all the data for tranng and the other 50% for testng. In the tranng phase, we chose p tran equal to 0.5 and p val equal to 0.5. The results wth four runs wth dfferent tranng and testng data are shown n Table 1. From Table 1, we can see that when the valdaton accuracy nearly reaches 100%, the testng accuracy also reaches ts maxmum. Because 100% valdaton accuracy means there are no msclassfed valdaton samples, the teraton process s stopped after nearly reachng ths value. Table 1. Comparson of the testng classfcaton accuracy between the classfer traned by all tranng data and the classfer traned by the proposed method. TestByAll means testng accuracy of the classfer traned by all tranng data. Vald means valdaton accuracy of that teraton. Test. means testng accuracy of that teraton. TestByAll 0.5634 0.57746 0.59155 0.5493 teraton Vald. Test Vald Test Vald Test Vald Test 1 0.5714 0.5634 0.6286 0.5775 0.5143 0.5775 0.6000 0.5493 2 0.7143 0.4366 0.7714 0.4507 0.8286 0.4225 0.6571 0.4789 3 0.9714 0.7183 0.8571 0.4789 0.8286 0.5493 1.0000 0.5916 4 1.0000 0.6620 1.0000 0.6620 1.0000 0.6197 1.0000 0.5916 5 1.0000 0.6620 1.0000 0.6620 1.0000 0.6197 1.0000 0.5916 To test whether our proposed method s sgnfcant to mprove the accuracy, we could convert the decmal value to the percentle value and then calculate Ka Square. By pckng α=0.05, f the Ka square s larger than 0.3841, we can say that our proposed method s sgnfcantly dfferent. k 2 2 ( x E ) χ = (5) = 1 E where x s the percentle value of testng accuracy from our proposed method, and E s the percentle value of testng accuracy from the classfer whch s traned by usng all tranng data. Due to short of space, we only show 4 cases. After we take more runs, we can see that t s sgnfcantly dfferent. The followng fgure (Fg. 2) shows that the testng accuracy drops down frst, and then boost to hgher than ntal one.
5 Fgure 2. The testng classfcaton accuracy vares wth teratons. In order to test for consensus results,we fxed the same tranng data and testng data for 3 classfers, and then used majorty votng to combne dfferent classfer results. The results are shown n Tables 2 and 3. Table 2. Combnng 3 dfferent classfers by majorty votng. In ths tranng set and testng set, f we use all the tranng data to tran the classfer, the testng accuracy s 0.5493. The three classfers are generated from dfferent ntalzatons of the tranng set and the valdaton set. Classfer 1 2 3 Iteraton Vald. Test. Vald. Test. Vald. Test. Consensus Test. 1 0.54286 0.5493 0.57143 0.5493 0.6286 0.5493 0.5493 2 0.74286 0.46479 0.74286 0.46479 0.6857 0.4507 0.4507 3 1 0.60563 1 0.60563 1 0.60563 0.6056 4 0.97143 0.60563 1 0.60563 1 0.60563 0.6056 Table 3. Combnng 3 dfferent classfers by majorty votng. In ths tranng set and testng set, f we use all the tranng data to tran the classfer, the testng accuracy s 0.5634. Classfer 1 2 3 Consensus Iteraton Vald. Test. Vald. Test. Vald. Test. Test. 1 0.6000 0.5634 0.3143 0.4648 0.5429 0.5634 0.5634 2 0.6571 0.4507 0.7714 0.5634 0.7143 0.4507 0.4648 3 1 0.6479 0.9429 0.5775 1.0000 0.6620 0.6479 4 1 0.6479 0.9714 0.5775 1.0000 0.6620 0.6479 DISCUSSION AND CONCLUSIONS From the results of the experments, t s apparent that the resamplng technque does generate a better tranng set to tran the classfer, resultng n better classfcaton accuracy. Another approach would be to use all the tranng data to tran the classfer, and then used the classfer to search for the msclassfed vectors. However, then t s lkely to get 100% tranng accuracy, meanng we do not know whch samples to emphasze. Includng valdaton sets works better due to ths reason.
6 In some cases, we notced that the testng accuracy was lower n teraton 2 or 3. However, the testng results always mproved when we reached 100% valdaton accuracy n succeedng teratons. In prevous boostng methods, t s possble to overft by runnng too many rounds. Wth our approach, we only add the msclassfed valdaton samples to the tranng set. When 100% valdaton accuracy s reached, there are no more rounds to be used. We also notced that f the valdaton accuracy n the frst teraton s not suffcently hgh, such as better than 50%, t s a good dea to regenerate the ntalzaton of the tranng and valdaton sets. Ths approach reduces the number of teratons to get the best tranng set. We also consdered rates of convergence. In all the experments, the maxmum valdaton accuracy s always reached wthn 5 teratons. Ths may take extra computaton tme compared to usng all the tranng set to tran one classfer, but the number of teratons to get better results s not excessve. By usng random ntalzaton of the tranng set and the valdaton set, we can generate a number of dfferent classfers. The results from these classfers can be combned, for example, by usng majorty votng to acheve better results. However, n our consensus experments, the results dd not mprove further. Ths topc needs to be further nvestgated. We only generated 3 classfers and then aggregated the results. It s possble that more classfers would ncrease performance. In the experments, we chose p tran equal to 0.5 and p val eaul to 0.5. To estmate the optmal values of these parameters, we should do further research. ACKNOWLEDGEMENT Ths research was supported by NSF Grant MCB-9873139 and partly by NSF Grant #0325544. REFERENCES Joseph A. Cruz, and Davd S. Wshart, 2006, Applcatons of Machne Learnng n Cancer Predcton and Prognoss, Cancer Informatcs. Hyun-Chul Km, Shaonng Pang, Hong-Mo Je, Dajn Km, Sung Yang Bang, 2003, Constructng support vector machne ensemble, Pattern Recognton 36 pp. 2757 2767. Leo Breman, 1996, Baggng Predctors Machne Learnng 24 (2) pp.123 140. Y. Freund and R.E. Schapre, 1994, A decson-theoretc generalzaton of on-lne learnng and an applcaton to boostng, In Euro COLT: European Conference on Computatonal Learnng Theory. LNCS. Y. Freund and R.E. Schapre, 1996, Experments wth a new boostng algorthm, In Proceedngs 13th Internatonal Conference on Machne Learnng, pp. 146-148. Morgan Kaufmann. Y. Freund and R.E. Schapre,1997, A decson-theoretc generalzaton of on-lne learnng and an applcaton to boostng. Journal of Computer and System Scences,55(1):119-139. Igor Kononenko and Bojan Cestnk, 1988, Repostory of machne learnng databases, http://www.cs.uc.edu/~mlearn/mlrepostory.html, Irvne, CA: Unversty of Calforna, Department of Informaton and Computer Scence. B. E. Boser, I. M. Guyon, and V. N. Vapnk, 1992, A tranng algorthm for optmal margn classfers, D. Haussler, edtor, 5th Annual ACM Workshop on COLT, pages 144-152, Pttsburgh, PA. ACM Press. Thorsten Joachms, 2004, http://www.cs.cornell.edu/people/tj/svm_lght/.