THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

Proceedngs of the 20 Internatonal Conference on Machne Learnng and Cybernetcs, Guln, 0-3 July, 20 THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY JUN-HAI ZHAI, NA LI, MENG-YAO ZHAI Key Lab. of Machne Learnng and Comutatonal Intellgenceollege of Mathematcs and Comuter Scence, Hebe Unversty, Baodng 07002hna E-MAIL: mczh@hbu.cn Abstract: The fuzzy k-nearest neghbor (F-KNN algorthm was orgnally develoed by Keller n 985, whch generalzed the k-nearest neghbor (KNN algorthm and could overcome the drawback of KNN n whch all of nstances were consdered equally mortant. However, the F-KNN algorthm stll suffers from the roblem of large memory requrement same as the KNN. In order to deal wth the roblem, ths aer rooses the condensed fuzzy k-nearest neghbor rule (CFKNN whch selects the mortant nstances based on samle fuzzy entroy. The exermental results show that our roosed method s feasble and effectve. Keywords: Nearest neghbor; Condensed nearest neghbor; Fuzzy nearest neghbor; Instance selecton; Samle fuzzy entroy. Introducton The nearest neghbor rule (NN was orgnally roosed by Cover and Hart[l]and s wdely used n many felds such as attern recognton[2], data mnng[3], and machne learnng[4-7]. The reasons for the use of ths rule are ts concetual smlcty and easy to understand. However, the nearest neghbor rule suffers from the followng three roblems: ( To classfy a test nstance x, t s requred to store all nstances n tranng set,.e. the comutatonal sace comlexty s O ( n, where n s the number of nstances n tranng set; (2 The dstances between x and all nstances n tranng set are needed to comute,.e. the comutatonal tme comlexty s also O ( n ; (3 Each of the tranng samles s gven equal mortance n classfyng an unseen samle, regardless of ther dfferent contrbuton to classfcaton. In order to deal wth the roblems mentoned above, many researchers dd a great deal of research works. Hart roosed the condensed nearest neghbor rule (CNN to deal wth the roblem ( and (2 [8]. Keller roosed F-KNN algorthm to deal wth the roblem (3 [9]. However, F-KNN algorthm stll suffers from the roblem ( and (2. In ths aer, we roose the method of the condensed fuzzy nearest neghbor rule (CFKNN based on samle fuzzy entroy. The exermental results show that the roosed method s feasble and effectve. The aer s organzed as follows. Prelmnares on our roosed method are gven n secton 2. In secton 3 the CFNN s resented and exermental results and analyss are rovded n Secton 4, Secton 5 concludes the aer. 2. Prelmnares In ths secton, we wll ntroduce several basc concets and algorthms related to our method, manly ncludng the concet of decson table, fuzzy entroy, the algorthm used for determnng the fuzzy membersh degree of nstances n tranng set, and the FNN algorthm. Defnton A decson table (DT n short s a 2-tule DT = ( U, A C where U = { x, x, 2, x N } s a non-emty fnte set of obects (nstances called tranng set and A s a set of real-valued condtonal attrbutes. C s the decson attrbute, wthout loss of generalty we suose that the nstances n U are classfed nto categores C 2,. Defnton 2 Gven a decson DT = ( U, A C, x U, C ( n;, let μ ( x be the fuzzy membersh degree of nstance x belong to class C, the fuzzy entroy of nstance x s defned as follows Entr ( x = ( x log 2 μ( x = μ ( 978--4577-0308-9//$26.00 20 IEEE 282

Proceedngs of the 20 Internatonal Conference on Machne Learnng and Cybernetcs, Guln, 0-3 July, 20 where the ( x, μ C denotes the fuzzy membersh degree of nstances belong to class C. In ths aer, we use the followng algorthm to determne the fuzzy membersh degree of nstances n tranng set U [0]. Algorthm Inut: A DT wth real-valued condtonal attrbutes. Outut: the fuzzy membersh degree of nstances. STEP : For each class C, calculatng the center c of class C ; STEP 2: For each nstance x U ( n the dstance x and ( d between STEP 3: For each nstance membersh degree ( x 2 ( ( d x = 2 ( d =, calculatng x U C μ as follows, ;, calculatng the fuzzy μ (2 For the convenence, we lst the F-KNN algorthm as follows. Algorthm 2 Inut: A DT wth real-valued condtonal attrbutes, and a test nstance x. Outut: Fuzzy K-nearest neghbor rule. STEP : Intalze K=K 0 ; STEP 2: For each test x, found K-nearest neghbors of x n DT; STEP 2. Intalze =; STEP 2.2 Do STEP 2.3 Comute dstance from x to x ; STEP 2.4 IF ( K 0 THEN Include x n the set of K 0 -nearest neghbors; ELSE IF (x closer to x than any revous nearest neghbor THEN Delete the farthest of the k-nearest neghbors; Include x n the set of K 0 -nearest neghbors; STEP 3: For each test x, comute μ ( x,c usng formula (2 and (3; ( x C K0 μ( x =, = K 0 = x x x x 2 m 2 m μ (3 3. The Condensed Fuzzy K-Nearest Neghbor Rule Based on Samle Fuzzy Entroy In ths secton, we wll resent our method of the CFKNN, In KNN and F-KNN, the roblem of hgh comutatonal comlexty s encountered unavodably due to storng all nstances n tranng set. In fact, for a gven fuzzy nformaton system, dfferent nstance n tranng set has dfferent mortant degree, and has dfferent contrbuton to classfcaton. Some nstances may be more mortant than the others. In our method we select the set of the mortant samles based on the fuzzy entroy of the samle n tranng set. The bgger the fuzzy entroy s, the more mortant the samle, because the samle wth bgger fuzzy entroy can rovde more nformaton for classfcaton and they are closer to the boundares of class. So the set of the mortant samles usually contans the same nformaton wth orgnal dataset n classfcaton. If the mortant samle set s used as tranng set to classfy an unseen test samle, the effcency of classfcaton can be ncreased, and comutatonal comlexty degree can be decreased. In the followng, we rovde the CFKNN algorthm for our method roosed above. Algorthm 3 Inut: a DT, arameter K and α (suose DT =n, and the samles n T are classfed nto classes Outut: S DT STEP : For each nstance x DT, d determne the fuzzy membersh degree of x wth (2; STEP 2: Randomly select one nstance belongng to each class from tranng set, and ut the selected samles nto S. STEP 3: Reeat the followng rocess for each x remaned n DT. STEP 3.: Fnd K nearest neghbors n S; STEP 3.2: Determne the class membersh degree of x wth (3 ( μ( x, μ( x, C2,, μ( x, C ; STEP 3.3: Comute the fuzzy entroy of nstance x 283

Proceedngs of the 20 Internatonal Conference on Machne Learnng and Cybernetcs, Guln, 0-3 July, 20 wth ( Entr( x = μ ( x log 2 μ( x ; = STEP 3.4: If Entr( x > α dscard x; STEP 3: Return S. 4. Exermental results then S=S{x}; Else The effectveness of our roosed method s demonstrated through numercal exerments n the envronment of Matlab 7.0 on a Pentum 4 PC. In our exerments we totally select 9 datasets ncludng 7 UCI datasets [8] and 2 real world datasets []. The 7 UCI datasets are Irs Dataset, Breast Cancer Dataset-WDBC, Breast Cancer Dataset-WPBC, Glass Dataset, Image Segmentaton Dataset, Parknsons Dataset, Pma Dataset. The 2 real world datasets are CT Image Dataset and RenRu Dataset. The CT Dataset s obtaned by collectng 22 medcal CT mages from Baodng local hostal. All nstances wth 35 numercal attrbutes are classfed nto 2 classes (.e., normal class and abnormal class. The RenRu Dataset s created by the key laboratory of machne learnng and comutatonal ntellgence of Hebe Provncehna. The RenRu Dataset s obtaned by collectng 48 Chnese characters REN and RU wth dfferent tyeface, font and sze, n whch there are 92 Chnese characters REN and 56 Chnese characters RU. For each Chnese character, t s descrbed by 26 numercal features. The basc nformaton of the 0 datasets s lsted n table. In the exerment, we set K=5, and randomly select 70% data n each dataset as tranng set, other 30% data as testng set, For each dataset, we run 0-fold cross-valdaton ten tmes, the exermental results are the average of the 0 oututs and lsted n table 2. The exermental results demonstrate that the roosed method s effectve and effcent. In the exerment, we also exlore the relaton between the value of α and the testng accuracy n CFKNN. We change the arameterα from 0.5 to.0, and each tme adds 0.05. Wth α set to dfferent values we record the classfcaton accuraces of the CFKNN, the changng curves are shown n fgure. From the curves, we can see that the value of α does affect the classfcaton result. For Irs dataset, t s arorate for α take values n the nterval [0.5, 0.75]. For WDBC and WDBC dataset, the arorate nterval s [0.5, 0.7] and [0.5, 0.85] resectvely. For the other datasets, t s arorate for α take values n the nterval [0.5, 0.95]. In addton, n the exerment, we study the relatonsh between the value of α and the number of nstances selected by CFKNN. The curves descrbng the relatonsh between α and the number of nstances selected are obtaned and shown n fgure 2. From the curves, we have observed that the value of α does affect the number of nstances selected by CFKNN, wth the ncrease of the value of α, the number of nstances selected by CFKNN s decreased contnually, when α > 0.5, most of curves become more and more smoothly excet WPBC. So the number of nstances selected by CFKNN wll have lttle change wth the ncrease of the value of α when α > 0. 5. Consderng the testng accuracy, t s reasonable that α takes dfferent value between 0.5 and 0.95 for dfferent dataset. Fgure. The average accuracy of CFNN on 9 datasets wth dfferent thresholds Fgure 2. The number of nstance selected by CFNN on 9 datasets wth dfferent thresholds 284

Proceedngs of the 20 Internatonal Conference on Machne Learnng and Cybernetcs, Guln, 0-3 July, 20 5. Conclusons In ths aer, n order to overcome the drawback of large comutatonal comlexty requrement of F-KNN, based on the fuzzy entroy of nstance, we roose the condensed fuzzy k-nearest neghbor rule (CFKNN. The exermental results show that our roosed method s feasble and effectve. Acknowledgments Ths research s suorted by the natonal natural scence foundaton of Chna (60903088, 60903089, by the natural scence foundaton of Hebe Provnce (F200000323, F2020063, by the Key Scentfc Research Foundaton of Educaton Deartment of Hebe Provnce (ZD20039, by the Scentfc Research Foundaton of Educaton Deartment of Hebe Provnce (200932, 200940, and by the Undergraduate Scence and Technology Innovaton Proects of Hebe Unversty (20043. References []. T. Cover, P. Hart. Nearest neghbor attern classfcaton. IEEE Transactons on Informaton Theory, 967, 3(:2-27. [2]. B. Dasarathy. Nearest Neghbor (NN Norms: NN Pattern Classfcaton Technques. Comuter Socety Press, 99. [3]. X. Wu, V. Kumar, J. R. Qunlan et al. To 0 algorthms n data mnng. Knowledge Informaton System, 2008, 4(:-37. [4]. T. M. Mtchell. Machne learnng. McGraw-Hll Comanes, Inc. 2003. [5]. K. Small, D. Roth. Margn-based actve learnng for structured redctons. Internatonal Journal of Machne Learnng and Cybernetcs, 200, (-4:3-25. [6]. L. Wang. An mroved multle fuzzy NNC system based on mutual nformaton and fuzzy ntegral. Internatonal Journal of Machne Learnng and Cybernetcs, 20, 2(:25-36. [7]. Z. Lu, Q. Wu, Y. Zhang et al. Adatve least squares suort vector machnes flter for hand tremor cancelng n mcrosurgery. Internatonal Journal of Machne Learnng and Cybernetcs, 20, 2(:37-47. [8]. P. Hart. The condensed nearest neghbor rule. IEEE Transacton on Informaton Theory, 968, 4(5:55-56. [9]. J. M. Keller, M. R. Gray, J. A. Gvens. A fuzzy k-nearest neghbor algorthm. IEEE trans. on SMC, 985, 5(4:580-585. [0]. J. H. Zha. Fuzzy decson tree based on fuzzy-rough technque [J]. Soft Comutng, 200, DOI: 0.007/s00500-00-0584-0. []. C. L. Blake. J. Merz. UCI Reostory of machne learnng databases. 996, htt://www.cs.uc.edu/~mlearn/mlreostory.html. [2]. X. Z. Wang, J. H. Zha, S. X. Lu, Inducton of multle fuzzy decson trees based on rough set technque, Informaton Scences, 2008,78(6:388-3202. 285

Proceedngs of the 20 Internatonal Conference on Machne Learnng and Cybernetcs, Guln, 0-3 July, 20 Table. The basc nformaton of the 0 datasets used n our exerments DB Number of nstances Number of attrbutes Number of classes Irs 50 4 3 WDBC 555 30 2 WPBC 9 33 2 Glass 60 9 6 Image 94 9 7 Parknsons 95 22 2 Pma 768 8 2 CT Image 22 35 2 RenRu 48 26 2 Table 2. Exermental results wth K=5 Dataset The number of selected nstances The average accuracy CPU Tme(s Irs 0.65 50 0.96 0.026 WDBC 0.55 74 0.92 0.284 WPBC 0.80 90 0.68 0.083 Glass 0.90 3 0.68 0.0580 Image 0.90 03 0.80 0.056 Parknsons 0.95 68 0.75 0.082 Pma 0.75 377 0.69 0.4828 CT Image 0.90 83 0.84 0.049 RenRu 0.95 59 0.8 0.0256 286