An algorithm for correcting mislabeled data

Size: px

Start display at page:

Download "An algorithm for correcting mislabeled data"

Joshua Martin
5 years ago
Views:

1 Intellgent Data Analyss 5 (2001) IOS Press An algorthm for correctng mslabeled data Xnchuan Zeng and Tony R. Martnez Computer Scence Department, Brgham Young Unversty, Provo, UT 842, USA E-mal: {zengx,martnez}@cs.byu.edu Receved 12 June 2001 Revsed 22 July 2001 Accepted 25 August 2001 Abstract. Relable evaluaton for the performance of classfers depends on the qualty of the data sets on whch they are tested. Durng the collectng and recordng of a data set, however, some nose may be ntroduced nto the data, especally n varous real-world envronments, whch can degrade the qualty of the data set. In ths paper, we present a novel approach, called ADE (automatc data enhancement), to correct mslabeled data n a data set. In addton to usng mult-layer neural networks traned by backpropagaton as the basc framework, ADE assgns each tranng pattern a class probablty vector as ts class label, n whch each component represents the probablty of the correspondng class. Durng tranng, ADE constantly updates the probablty vector based on ts dfference from the output of the network. Wth ths updatng rule, the probablty of a mslabeled class gradually becomes smaller whle that of the correct class becomes larger, whch eventually causes the correcton of mslabeled data after a number of tranng epochs. We have tested ADE on a number of data sets drawn from the UCI data repostory for nearest neghbor classfers. The results show that for most data sets, when there exsts mslabeled data, a classfer constructed usng a tranng set corrected by ADE can acheve sgnfcantly hgher accuracy than that wthout usng ADE. Keywords: Neural networks, backpropagaton, probablty labelng, mslabeled data, data correcton 1. Introducton In the feld of machne learnng, neural networks and pattern recognton, a typcal approach to evaluate the performance of classfers s to test them on some real-word data sets (such as those from the UCI machne learnng data repostory). Clearly, the qualty of data sets affects the relablty of the evaluatons. In the process of collectng and recordng data n the real-word, however, some nose may be ntroduced nto data sets, due to varous sources of error. The ncluson of nose n data sets would, as a consequence, affect the qualty of evaluaton of the classfers beng tested. Ths ssue has been addressed prevously usng varous approaches n several research areas, especally n nstance-based learnng whose performance s partcularly senstve to nose n tranng data. To elmnate nose n a tranng set, Wlson used a 3-NN (Nearest Neghbor) classfer as a flter (or preprocessor) to elmnate those nstances that are msclassfed by the 3-NN, and then appled 1-NN on the fltered data as the classfer [7]. Several versons of edted nearest neghbor algorthms [5 7,9] save only selected nstances for generalzaton n order to reduce storage whle stll mantanng a smlar accuracy. The algorthm proposed by Aha et al. [1,2] removed nose and reduced storage by retanng only those nstances that have good classfcaton records when appled as nearest neghbors. Wlson and Martnez [18,19] proposed several nstance-prunng technques whch are capable of removng nose and reducng the requrement of storage X/01/$ IOS Press. All rghts reserved

2 492 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data The dea of usng selected nstances n tranng data has also been appled to other classfers. In an approach proposed by John [12], tranng data s frst fltered by removng those nstances pruned by a C4.5 tree [15], and a new tree s then constructed usng fltered data. Gamberger et al. [8] proposed a nose detecton and elmnaton method based on compresson measures and the Mnmum Descrpton Length prncple. Brodley and Fredl [4] appled an ensemble of classfers as a flter to dentfy and elmnate mslabeled tranng data. Teng [16,17] appled a procedure to dentfy and correct nose n class and attrbutes, based on the predctons of c4.5 decson trees. In ths work we present a novel approach, called ADE (automatc data enhancement), to correct mslabeled nstances n a data set. Ths approach s based on the mechansms of neural networks traned by backpropagaton. However, a dstnct feature of ths approach, n contrast to those of standard backpropagaton, s that each pattern n the data set s assgned a class probablty vector whch s constantly updated (nstead of beng fxed) durng tranng. The class label for each pattern s determned by ts class probablty vector and thus s also updated durng tranng. Usng ths new mechansm, an ntally mslabeled class could be corrected through gradually changng ts class probablty vector. The class probablty vector s updated n such a way that t becomes closer to the output of the network. The output of the network s, however, determned by the archtecture and weght settngs of the network, whch s the result of prevous tranng usng all patterns n the whole data set. If the ntal mslabeled percentage s reasonably small (e.g., < %), the network wll be predomnantly determned by those correctly labeled patterns. Durng tranng, the outputs of the network become more consstent wth the class probablty vectors of correctly labeled patterns whle less consstent wth those of ncorrectly labeled patterns. The updatng rule modfes the class probablty vectors of mslabeled patterns by a larger amount due to ther hgher nconsstences wth the outputs than those of correctly labeled patterns. For a mslabeled pattern, the probablty component of the mslabeled class becomes smaller whle that of the correct class becomes larger. After a number of tranng epochs, the component of the correct class gradually ncreases to a level larger than that of the mslabeled class (whch s ntally the largest). At that pont, the mslabeled class s modfed to the correct class. We have tested the performance of ADE on 24 data sets drawn from the UCI data repostory usng a nearest neghbor classfer. For each data set, we frst mslabel a fracton of the tranng set, and then apply ADE to correct mslabeled tranng data. We compare the test set accuraces of two nearest neghbor classfers one usng the tranng set wth mslabelng and the other usng the tranng set corrected by ADE. The stratfed 10-fold cross-valdaton method s appled for estmatng the accuraces. We conducted 20 stratfed 10-fold cross-valdatons for each data set n order to acheve a relable estmaton. The results show that for most data sets, the classfers usng ADE for correcton are capable of achevng sgnfcantly hgher accuraces than those wthout usng ADE. Even when there s no mslabeled data, a classfer usng ADE can acheve a hgher accuracy for some data sets, showng the general utlty of ADE as a data correctng algorthm. 2. Related work Some approaches have been prevously proposed to handle mslabeled data. Most of them focused on dentfyng mslabeled nstances and then applyng a flterng mechansm to remove them from the tranng set. Several early works on nearest neghbor classfers [5 7,9] appled varous methods to remove nose or combne some nstances, formng an edted or condensed data set. The edted set was then appled for buldng classfers for generalzaton. The man beneft of the approaches s the reducton for storage

3 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data 493 requrement, whle stll mantanng accuraces smlar to or only slghtly lower than those usng the orgnal data sets. The algorthm IB3, one verson of IBL (nstance-based learnng) proposed by Aha et al. [1,2], keeps track of the record of classfcaton accuracy for each nstance n the orgnal data set, and then retans only those nstances whose record s better than a certan threshold. The retaned data s then appled to construct classfers. They showed that IBL has the capablty of removng nose and reducng storage as well. Wlson and Martnez [18,19] proposed several nstance-prunng technques whch are nose-tolerant and capable of reducng the number of nstances retaned n memory whle mantanng (and sometmes mprovng) generalzaton accuracy. Gamberger et al. [8] presented a nose detecton and elmnaton method for nductve learnng. Ths method s based on compresson measures and the Mnmum Descrpton Length prncple, and elmnates nosy nstances by fndng a mnmal example set. The method was appled to the CN2 rule nducton algorthm on a dseases dagnoss doman. The results showed ncreased accuracy after applyng the nose elmnaton algorthm. Brodley and Fredl [4] appled an ensemble of classfers as a flter to dentfy and elmnate mslabeled tranng data. An ensemble of three classfers, a 1-NN, a lnear machne and a unvarate decson tree, s appled to classfy each data nstance. An nstance s dentfed as msclassfed and s removed f all three classfers output the same class and that class s dfferent from the orgnal labelng. They evaluated ther algorthm by an emprcal study usng a real-world data set consstng of land-cover maps of the Earth s surface. The results demonstrated that the classfers constructed usng the fltered data can acheve a hgher accuracy than those usng the orgnal set, n whch certan controlled fractons of mslabelng are ntroduced. Teng [16,17] ntroduced a procedure, called polshng, to dentfy and correct nose both n classes and n attrbutes. In the frst phase (predcton phase), 10-fold cross-valdaton was appled for data parttonng of the tranng set and test set. For each test set, a c4.5 decson tree classfer [15] was constructed based on the tranng set. It was then appled to classfy each nstance n the test set, and ts output was consdered as a predcted value and was used as a reference for correcton. To deal wth nose for an attrbute, the class was treated as an nput and the attrbute as the output. In the second phase (adjustment phase), each msclassfed nstance s attrbutes were adjusted (replaced wth the predcted values from the frst phase) so that they can be correctly classfed. If no combnaton of attrbute value replacement was capable of correctng the classfcaton, then ts class value was replaced wth the predcted one (from the frst phase). The procedure was tested on 12 data sets from the UCI machne learnng data repostory, showng capablty of dentfyng and correctng nose, and the capablty of mprovng the accuracy of classfers through usng corrected tranng data. The ADE procedure presented n ths work has the followng dstnct features compared to prevous approaches for smlar tasks. () In prevous work [4,16,17], a data set was frst dvded nto two dsjont sets: a tranng set and a test set. The nose n the test set was dentfed through the use of predctons made by a classfer or an ensemble of classfers constructed from the tranng set. However, the tranng set tself conssts the same percentage of nose as the test set. A classfer constructed n such a way may not have good qualty (especally when a hgh level of nose exsts n the data set) and thus t may not be able to make accurate predctons about the nose. In contrast, ADE ncludes all nstances n the process and allows every nstance to change ts class, wthout relyng on a pre-constructed classfer. () By usng a class probablty nstead of a bnary class label, ADE allows a large number of hypotheses about class labelngs to nteract and compete wth each other smultaneously, and let them smoothly and

4 494 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data ncrementally converge to an optmal or near-optmal soluton. Ths type of strategy has been shown effcent n searchng a large soluton-space for NP-class optmzaton problems usng relaxaton-type neural networks [10]. () Compared to usng other types of classfers, usng mult-layer feed-forward networks can take advantage of ther hgh capacty for fttng the target functon. It has been shown that a network wth one hdden layer has the capacty of approxmatng any functon [11]. (v) Both nomnal and numercal attrbutes can be easly handled usng the ADE procedure, n contrast to the lmtaton to the type of attrbutes when usng other procedures (for example, each attrbute needs to be nomnal when usng the polshng procedure). (v) Compared to the strategy of removng nose [4,8], correctng mslabeled data s partcularly useful for small-sze data sets (for whch data s sparse or data collecton s costly) because every nstance can be used as the tranng data. In comparson, removng part of the data from an already sparse data set could sgnfcantly reduce the performance of a classfer whch was traned usng ths data set. 3. Algorthm Let S be an nput data set n whch some nstances have been mslabeled, Our task s to fnd a procedure to correct those mslabeled nstances and then output a corrected data set Ŝ. There are varous domans n whch pattern recognton technques can be appled. Most of the domans whch we are nterested n possess the followng two propertes: () a data set contans some degree of regularty (nstead of beng totally random), whch can be dscovered and be used to buld a classfer whch has capablty of makng predctons that are better than random guessng; () when a reasonably small fracton of the data set s mslabeled, those regulartes wll stll be mantaned to a certan degree, although they may be weakened by the mslabelng. Let α be the non-mslabeled fracton and β (= 1 α) the mslabeled fracton of nput data set S. Let S (c) (m) be the correctly labeled set and S the mslabeled set (S(c) + S(m) = S). The nstances n S (c) (m) have a tendency of strengthenng the regulartes possessed n S, whle those n S have a tendency of weakenng the regulartes due to the random nature of mslabelng. However, f the mslabeled fracton s small (.e., β<<α), the trend of mantanng the regulartes due to S (c) domnant. The strategy of ADE s to apply the dscovered regulartes n S (c) wll be to correct those mslabeled nstances n S (m). We use a mult-layer perceptron as the underlyng classfer to capture the regulartes contaned n S. The reason for ths choce s that neural networks have demonstrated capablty of detectng and representng regulartes (features) n a data set. Neural networks wth one hdden-layer have the capablty of approxmatng any functon [11]. We use backpropagaton as the tranng procedure for the network. The format and procedure adopted n our approach s the same as those of standard backpropagaton networks except for the followng dfferences. In standard procedure, each nstance v n S has the followng format: v =(x,y) (1) where x =(x 1,x 2,...,x f ) s the feature vector of v, and f s the number of features; y s the class (category) label of v.

5 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data 495 In our approach, however, we attach a class probablty vector to each nstance v n S: v =(x,y,p) where p =(p 1,p 2,...,p c ) s the class probablty vector of v, and c s the number of class labels. For each class, ts probablty p s proportonal to the output V of the correspondng output node n the network, whch s determned by the nput U of that node through the sgmod actvaton functon: V = 1 2 (1 + tanh(u )) (3) u 0 where u 0 =0.02 s the amplfcaton parameter that reflects the steepness of the actvaton functon. The sgmod functon s n the range of [0.0, 1.0]. In addton to updatng weghts n the network usng standard backpropagaton procedure, we also update the class probablty vector p durng tranng. The updatng of p depends on the dfference between the current p value and the values of the output nodes n the network. For each class, we frst update ts nput U, and then map the updated nput to ts output V (whch s proportonal to p ) through the sgmod functon. After each update, p gets closer to the output node value. Because the regulartes reman n S for a small fracton of mslabelng, the network wll gradually become capable of reflectng the regulartes after suffcng tranng. Durng ths process, the correctly labeled set S (c) plays a more mportant role than the mslabeled set S (m) n shapng the weght confguraton of the network. The reason s that S (c) contans more nstances than S (m) and thus, accordng to the updatng rule of backpropagaton, t has more opportuntes to update the weghts. In ths way, the weght confguraton wll be gradually changed to reflect the regulartes of the data set. Thus f v s a correctly labeled nstance, ts output vector O =(O 1,O 2,...,O c ) (where O s the value of output node ) wll become more consstent wth the class probablty vector p =(p 1,p 2,...,p c ) after a certan amount of tranng. In contrast, f v s a mslabeled nstance, O wll be less consstent wth p snce mslabeled nstances do not follow the regulartes. However, by updatng p so that t becomes closer to O usng ADE, we can cause a mslabeled nstance to gradually change ts class probablty vector p and eventually correct the mslabeled class. The followng explans the basc steps of ADE. The weghts of the network are ntally set randomly wth unform dstrbuton n the range [ 0.05, 0.05]. For each nstance v =(x,y,p) (where y s the ntal class label), ts output vector V (proportonal to ts probablty vector p) s ntally set as follows. V y (the output probablty for class y) s set to be a large fracton D (0.5 <D<1.0), and then (1-D) s dvded equally nto the other (C 1) output components. We have tested dfferent D values n the experment. The results are smlar when D s n the range (0.8, 1.0) and dropp slowly when D decreases from 0.8 to 0.5. In our experment we chose D =0.95. If C =3and y =1, for example, then p (O) 1 =0.95, and p (O) 2 = p (O) 3 = = The nput U s then determned from the correspondng output usng the nverse sgmod functon. The ntal number of hdden nodes s set to be 1. For each tranng nstance v, the weghts n the network and the probablty vector p for v are updated usng the followng procedure: () Update the network weghts usng standard backpropagaton algorthm, wth the learnng rate L n and momentum M n for the net. (2)

6 496 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data () For each class, update the output class probablty usng formula: U = U + L p (O V ) (4) where O s the value of output node for nstance v, and L p (dfferent from L n ) s the learnng rate for probablty vector. We treat O as the target for updatng the probablty vector. The output V s updated based on the nput U usng the sgmod functon (Eq. (3)). After each tranng epoch, the values of V for all ( =1, 2...C) are normalzed so that ther sum s 1, and then the probablty vector p s set to be equal to V (.e. p = V for =1, 2...C). () The class label y for nstance v (= (x,y,p)) s also updated, usng the followng formula: y =argmax{p ( =1, 2,...C)} (5) that s, y s relabeled to be the class wth the maxmum probablty. If v s a mslabeled nstance, for example, ts class label could be corrected usng ths mechansm after a certan number of tranng epochs, whch gradually update the class probabltes. After every N e epochs, the sum of squared errors (SSE) over all nstances n the data set s calculated to montor the progress of the tranng. If N e s too small, more computaton s needed; but f N e s too large, t s not able to accurately modnor the tranng progress. We have tred dfferent values for N e and found that the range for good performance s 5 <N e <. We chose N e =20n our experment because t was slghtly better than the other choces. Instead of usng SSE drectly, we use an adjusted verson: SSE (adj), whch s calculated usng the formula: SSE (adj) = SSE (std) + SSE (hn) + SSE (dst) (6) where SSE (std) s the standard SSE. SSE (hn) s an addtonal term takng nto account the effects of the number of hdden nodes. More hdden nodes can usually lead to a smaller SSE (std), but wth a hgher possblty of overfttng. To reduce ths effect, we add an addtonal error term SSE (hn) that ncreases wth the number of hdden nodes. We adopt the followng emprcal formula n ADE: SSE (hn) = A 1 (H 1)N(C 1)/C (H I) (7) =(A 1 (I 1) + A 2 (H I))N(C 1)/C (H >I) where H s the number of hdden nodes, I s the number of nput nodes, N s the number of nstances n the data set, and C s the number of classes. A 1 and A 2 are two emprcal parameters wth the constrant A 2 >A 1. We tred dfferent values for A 1 and A 2 and found that performance s good (and smlar) when 0.01 <A 1 < 0.1 and < 0.1 <A 2 < 0.5. In our experment, we chose A 1 =0.05 and A 2 =0.2. From the formula, we see that when H I, the error SSE (hn) s relatvely small (A 1 s small); but when H>I, SSE (hn) starts to ncrease more rapdly (A 2 >> A 1 ). SSE (dst) s another addtonal term takng nto account the devaton of current class dstrbuton from the ntal (orgnal) one. We assume that mslabelngs have a random nature each nstance has an equal chance to be mslabeled. Based on ths assumpton, we can nfer that the class dstrbuton for a data set wth mslabelng should reflect the one wthout mslabelng. Thus, we expect that f a procedure can accurately correct mslabeled data, the class dstrbuton should be about same before and after the correctng procedure and the dfference should be very small. That s why we ntroduce an error term SSE (dst) that ncreases wth the dfference. The class dstrbuton vector q s defned as q =(q 1,q 2,...,q C )=( N 1 N, N 2 N,... N C N ) (8)

7 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data 497 where N s the total number of nstances n the data set, and N s the number of nstances labeled wth class ( =1, 2,...,C). Note that an nstance u labeled wth class means that s the class wth the maxmum magntude among C components n ts current probablty vector p. Let q (nt) =(q (nt) 1,q (nt) 2,...q (nt) C ) and q (curr) =(q (curr) 1,q (curr) 2,...q (curr) C ) be the ntal and current class dstrbuton vector respectvely. Then SSE (dst) s calculated usng the formula SSE (dst) = N(C 1) C C =1 B D max(q (curr),q (nt) ) (9) where D = q (curr) q (nt) /q (nt) s the dfference fracton between q (curr) and q (nt). B s an emprcal parameter whch vares when D s n dfferent ranges. We have expermented wth varous parameter settngs for B and performance s good through a wde range 0.05 <B < 1.5. In our experment, we set B n the followng way (snce t performed slghtly better than other settngs): B =0.1when D < 0.05, and B =1.0when D We can see from Eq. (9) that the error SSE (dst) ncreases when the dfference between q (curr) and q (nt) becomes larger. SSE (dst) ncreases slowly when D s small, and t ncreases more rapdly when D surpasses a threshold (0.05). For a fxed number of hdden nodes H (startng from H =1n our experment), the calculated error SSE (adj) s compared wth the stored best (mnmum) of the prevous SSE (adj) after each N e (=20) epochs. If t s smaller, then t wll replace the prevous one as the new best SSE (adj) and be stored for future comparson and retreval, along wth the current network confguraton (weght settngs and number of hdden nodes) and class probablty vectors. If no better SSE (adj) s found after N m of N e epochs (equvalent to N e N m =20 N m epochs), we assume that the best confguraton for fxed H hdden nodes has been found and then we begn the tranng wth H +1hdden nodes n an effort to dscover an optmal confguraton. Dfferent N m values have been evaluated n our experment. If N m s too small (< 5), the performance drops because of the ncapablty of fndng the optmal confguraton. But f N m s too large (> ), the computaton ncreases greatly wthout any performance gan. The performance s about same as long as N m 10. In our experment we chose N m =10to save computaton cost whle mantanng performance. If two consecutve addtons of hdden nodes do not yeld a better result, we assume that the best confguraton has been found for the data set (For example, f 4 and 5 hdden nodes do not yeld a better result than 3 hdden nodes, we use 3 hdden nodes as the optmal choce). Usng the optmal settng, we relabel the data set usng the correspondng class probablty vectors. 4. Experments We have tested ADE on 24 data sets drawn from the UCI machne learnng data repostory [14], and evaluated ts performance usng the nearest neghbor classfer [7]. For each tested data set, we frst artfcally mslabel a fracton of the tranng data, and then apply ADE to correct mslabeled data. We then compare the test set accuraces of two versons of nearest neghbor classfers: one traned wth the mslabeled tranng set wthout correcton and the other traned wth the tranng set corrected by ADE. We appled stratfed 10-fold cross-valdaton [3,13] for estmatng the accuraces. For each data set, we conducted 20 stratfed 10-fold cross-valdatons and averaged the results to acheve a more relable estmaton.

8 498 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data Table 1 Descrpton of 24 UCI data sets Data Set sze #attr #num #symb #class Data Set sze #attr #num #symb #class australan led balance lense crx lymph echoc monk ecol monk hayes monk heartc pma hearth postop horse votng ono wave rs wave led zoo In each of 10 teratons for one stratfed 10-fold cross-valdaton, 9 folds of data are used as the the tranng set S and the other fold as the testng set T. We obtan a mslabeled tranng set S m by mslabelng β fracton of output classes n S usng the followng process. For each class (=1,2,...,C), βn nstances (N s the number of nstances of class ) are randomly mslabeled to one of the other (C-1) classes (.e., classes 1, 2,...-1, +1,...C). Among the βn nstances, the number of nstances labeled to class j s proportonal to the populaton of class j (same as q j defned n Eq. (8)). Usng ths procedure, the mslabeled set S m keeps the same class dstrbuton as the orgnal set S, whch s consstent wth the assumpton of random mslabelng. We then run ADE on S m and output a corrected tranng set S c. The performance of ADE s evaluated by comparng the test-set accuraces of the followng two classfers usng the nearest neghbor rule: NNR c based on the corrected tranng set S c and NNR m based on the mslabeled set S m wthout correcton (both usng 1-nearest neghbor). Both NNR c and NNR m use T as the testng set for each teraton. The nearest neghbor rule (NNR) [7] works as follows. To classfy an nput nstance v, NNR compares v wth all nstances n the tranng set and fnds the most smlar one u, and then classfes v as the same class as u (whch s also called 1-NN). One varaton s to classfy v based on the top k most smlar nstances (k-nn) n the tranng set usng a votng mechansm. The accuracy for one stratfed 10-fold cross-valdaton s the total number of correctly classfed nstances n all the 10 teratons dvded by the total number of nstances n the data set ( S + T ). For each data set, we conduct 20 such stratfed 10-fold cross-valdatons and then average them. Table 1 shows the sze and other propertes of the data sets; sze s the number of nstances; #attr s the number of attrbutes (not ncludng class); #num s the number of contnuous attrbutes; #symb s the number of nomnal attrbutes; #class s the number of classes. Fgures 1 and 2 show smulaton results on the 24 tested data sets. In each graph, the two curves dsplay the test-set accuraces of two nearest neghbor classfers one wthout usng ADE and the other usng ADE to correct mslabeled tranng data. Each graph also dsplays how the accuraces vary wth dfferent mslabelng levels (β). Each data pont represents the accuracy averaged over 20 stratfed 10-fold cross-valdatons, along wth the correspondng error bar wth a 95% confdence level. The results show that for most of these data sets, the classfer usng ADE performs sgnfcantly better than that wthout usng ADE, as long as the mslabeled level s less than %. In ths mslabeled range, the correctly labeled data s domnant and s capable of controllng the formaton of the network archtecture. Durng ths process, the formed network s able to gradually correct the class probablty vector of mslabeled data.

9 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data 499 Australan Balance Crx Echoc Ecol Hayes Heart(C) Heart(H) Horse Iono Irs Led Fg. 1. Smulaton results on real-world domans whch compare test-set accuraces of nearest neghbor classfers wthout ADE and wth ADE to correct mslabeled tranng data.

10 0 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data Led Lense Lymph Monk1 Monk2 Monk Pma Postop Votng Wave21 Wave Zoo Fg. 2. Smulaton results on real-world domans whch compare test-set accuraces of nearest neghbor classfers wthout ADE and wth ADE to correct mslabeled tranng data.

11 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data 1 One observaton s that as the mslabeled level ncreases (> %), the performance of ADE starts degradng. The reason s that the domnance of correctly labeled data becomes weaker wth the ncreased mslabeled level. As t approaches about %, there s no obvous domnance from ether correctly or ncorrectly labeled data. Ths explans why the performance drops dramatcally (the accuraces usng ADE are stll hgher than that wthout usng ADE n some cases) at ths pont. The performance of ADE vares wth dfferent data sets. It s sgnfcantly better than that wthout usng ADE for most tested data sets and s stll slghtly better than or smlar to for others. Another observaton s that even when the mslabeled level s 0 (.e. wthout addng any artfcally mslabeled data), the accuracy usng ADE s stll sgnfcantly hgher than that wthout usng ADE for some data sets (australan, crx, echoc, ecol, hearth, led7, lense, pma, postop, and wave21). Ths ndcates that these data sets may nclude some nose or mslabelngs already, and usng ADE to correct them allows neares neghbor classfers to acheve hgher test-set accuraces. 5. Summary In summary, we have presented an approach ADE to correct mslabeled data. In ths approach, a class probablty vector s attached to each nstance ts value evolves as tranng contnues. ADE combnes the backpropagaton network wth a relaxaton mechansm for the tranng. A learnng algorthm s proposed to update the class probablty vector based on the dfference of ts current value and the network output value. The archtecture, weght settngs and output values of the network are determned predomnantly by those correctly labeled nstances when mslabeled percentage n a data set s less than %. Ths mechansm enables class label correcton by allowng gradual changes for the class probablty vectors of mslabeled nstances durng tranng. We have tested the performance of ADE on 24 data sets drawn from the UCI data repostory by comparng the accuraces of two versons of nearest neghbor classfers, one usng the tranng set corrected by ADE and the other usng the tranng set wthout correcton. The results show that the classfers based on corrected tranng set usng ADE perform sgnfcantly better than those wthout usng ADE for most data sets. References [1] D.W. Aha and D. Kbler, Nose-tolerant nstance-based learnng algorthms, n Proceedngs of the Eleventh Internatonal Jont Conference on Artfcal Intellgence, Morgan Kaufmann, Detrot, MI, 1989, pp [2] D.W. Aha, D. Kbler and M.K. Albert, Instance-based learnng algorthms, Machne Learnng 6 (1991), [3] L. Breman, J.H. Fredman, R.A. Olshen and C.J. Stone, Classfcaton and Regresson Trees, Wadsworth Internatonal Group, [4] C.E. Brodley and M.A. Fredl, Identfyng and elmnatng mslabeled tranng nstances, n Proceedngs of Thrteenth Natonal Conference on Artfcal Intellgence, 1996, pp [5] G.W. Gates, The reduced nearest neghbor rule, IEEE Transactons on Informaton Theory (1972), [6] B.V. Dasarathy, Nosng around the neghborhood: A new system structure and classfcaton rule for recognton n partal exposed envronments, Pattern Analyss and Machne Intellgence 2 (19), [7] B.V. Dasarathy, Nearest Neghbor (NN) Norms: NN Pattern Classfcaton Technques, IEEE Computer Socety Press, Los Alamtos, CA, [8] D. Gamberger, N. Lavrac and S. Saso Dzerosk, Nose elmnaton n nductve concept learnng, n Proceedngs of 7th Internatonal Workshop on Algorthmc Learnng Theory, 1996, pp [9] P.E. Hart, The condensed nearest neghbor rule, Insttute of Electrcal and Electroncs Engneers and Transactons on Informaton Theory 14 (1968),

12 2 X. Zeng and T.R. Martnez / An algorthm for correctng mslabeled data [10] J.J. Hopfeld and D.W. Tank, Neural Computatons of Decsons n Optmzaton Problems, Bologcal Cybernetcs 52 (1985), [11] K. Hornk, M. Stnchcombe and H. Whte, Multlayer feedforward networks are unversal approxmators, Neural Networks 2 (1989), [12] G.H. John, Robust decson tree: Removng outlers from data, n Proceedngs of the Frst Internatonal Conference on Knowledge Dscovery and Data Mnng, AAAI Press, Montreal, Quebec, 1995, pp [13] R. Kohav, A study of cross-valdaton and bootstrap for accuracy estmaton and model selecton, n Proceedngs of the Internatonal Jont Conference on Artfcal Intellgence (IJCAI), 1995, pp [14] C.J. Merz and P.M. Murphy, UCI repostory of machne learnng databases, mlearn/mlrepostory.html, [15] J.R. Qunlan, C4.5: Programs for Machne Learnng, Morgan Kaufman, Los Altos, CA, [16] C.M. Teng, Correctng nosy data, n Proceedngs of 16th Internatonal Conference on Machne Learnng, 1999, pp [17] C.M. Teng, Evaluatng nose correcton, n Proceedngs of 6th Pacfc Rm Internatonal Conference on Artfcal Intellgence, Lecture Notes n AI, Sprnger-Verlag, [18] D.R. Wlson and T.R. Martnez, Instance Prunng Technques, n Machne Learnng: Proceedngs of the Fourteenth Internatonal Conference (ICML 97), Morgan Kaufmann Publshers, San Francsco, CA, 1997, pp [19] D.R. Wlson and T.R. Martnez, Reducton Technques for Exemplar-Based Learnng Algorthms, Machne Learnng 38(3) (2000),

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou