A classification scheme for applications with ambiguous data

A classfcaton scheme for applcatons wth ambguous data Thomas P. Trappenberg Centre for Cogntve Neuroscence Department of Psychology Unversty of Oxford Oxford OX1 3UD, England Thomas.Trappenberg@psy.ox.ac.uk Andrew D. Back Katestone Scentfc 64 MacGregor Tce Bardon, QLD 4065 Australa back@usa.net Abstract We propose a scheme for pattern classfcatons n applcatons whch nclude ambguous data, that s, where pattern occupy overlappng areas n the feature space. Such stuatons frequently occur wth nosy data and/or where some features are unknown. We demonstrate that t s advantageous to frst detect those ambguous areas wth the help of tranng data and then to re-classfy those data n these areas as ambguous before makng class predctons on test sets. Ths scheme s demonstrated wth a smple example and benchmarked on two real world applcatons. Keywords: data classfcaton, ambguous data, probablstc ANN, k-nn algorthm. 1. Introducton Adaptve data classfcaton s a core ssue n data mnng, pattern recognton, and forecastng. Many algorthms have been developed ncludng classcal methods such as lnear dscrmnant analyss and bayesan classfers, more recent statstcal technques such as k-nn (k-nearest-neghbors, MARS (multvarate adaptve regresson splnes, machne learnng approaches for decson trees etc, ncludng C4.5, CART, C5, Bayes tree and neural network approaches such as multlayer perceptrons and neural trees [1-8]. Most of these classfcaton schemes work well enough when the classes are separable. However, n many real world problems the data may not be separable. That s, there may exst regons n the feature space that are occuped by more than one class. In many problems, ths ambguty n the data s unavodable. A smlar problem occurs when the data are very closely spaced and a hghly nonlnear decson boundary s requred to separate the data. Accordngly, the ams of much recent work on classfcaton has been to seek ways to fnd better nonlnear classfers. Partcularly notable n ths area s the feld of support vector machnes [9, 10]. SVMs have captured much nterest, as they are able to fnd nonlnear classfcaton boundares whle mnmzng the emprcal rsk of false classfcaton. However, t s mportant to consder what the data really means and what the practcal, real world goals are. In many cases, t s desred to fnd a smple classfer whch gves the user a rough, but understandable gude to what the data means. On the other hand, the data tself may be contamnated by nose, some nput varables are completely mssng (.e. the data s then n a feature space whch s mplctly too low or when the data s flawed n other ways. Ths ssue s commonly referred to n the context of robust statstcs and outler detecton. In ths paper we propose a method for preprocessng ambguous data. The basc dea s qute straght forward: rather than seek a complex classfer, we am to frst examne the data wth the am of removng any ambgutes. Once the ambguous data s removed, we then apply whatever classfer s requred, hopefully one whch may lead to a much

smpler soluton than would otherwse be obtaned. In dong ths we acknowledge that data n some regons of the state space can ether not be classfed at all or else our confdence n dong so s low. Hence those data ponts are labeled n a dfferent manner to facltate better classfcaton. Our proposed scheme s to dentfy ambguous data and to re-classfy those data wth an addtonal class. We wll call ths addtonal class IDK (I don't know to ndcate that predctng the class of ths data should not be attempted due to t's ambguty. By dong so one looses some predcton of data. However, we wll show that we can gan on the other hand a drastcally ncrease n the confdence of the classfcaton of the remanng data. Our proposed scheme s outlned n the next secton. There we wll also ntroduce a partcular mplementaton that wll be used n the examples dscussed n ths paper. The synthetc example n secton 3 s amed to llustrate the underlyng problem on the proposed soluton n more detal. In secton 4 we wll report on the performance of our algorthms on two real world (benchmark data sets, that of the classcal Irs benchmark [11], and that of a medcal data set [12]. 2. Detecton of Ambguous Data As stressed n the ntroducton, our scheme s to detect ambguous state space areas and to re-classfy data n these areas. Hence, our proposed scheme has the followng steps: 1. Test all data ponts aganst a crtera of ambguty. 2. Re-classfy tranng data whch are ambguous. 3. Classfy test data wth algorthm traned on the re-classfed data Note that the scheme s ndependent of any partcular classfcaton algorthm. In practce t mght be crtcal to choose approprate algorthms for each of these steps. As a means of llustratng the proposed method, we use some partcular algorthms whch are descrbed below. For the frst step we employ a k-nn algorthm [3]. Ths algorthm takes the k closest data ponts to a data pont n queston nto account to decde on the new class of ths data pont n queston. If an overwhelmng maorty of neghborng data s of one partcular class, then ths class s taken to be the class of ths data pont. If no overwhelmng maorty can be reached then ths data pont s declared as ambguous and classfed as member of the class IDK. The next step requres a classfcaton method for the predctve classfcaton of further data (test data. Whle any type of adaptve classfer could be chosen, n the followng test we use a feedforward neural network Hdden Layer: (1 (1 (1 (1 (1 Hdden Layer : a = w k xk + θ ; h = f ( a (2 (2 (2 (2 (2 Output Layer : a = w h + θ ; y = f ( a wth a softmax output layer defned by the normalzed transfer functon k y = exp( a k exp( (2 (2 ak so that the outputs can be treated as probabltes, or confdence values, that the nput data belongs to a class ndcated by the partcular output node. Ths network s traned on the negatve cross entropy E = µ µ t y ( x µ ; w

Fgure 1: Example wth overlappng data. The left column shows examples wthn a standard classfcaton procedure, whereas the rght column shows examples wth the proposed re-classfcaton scheme. (a The raw nput data (tranng set wth data from class a (crcles and class b (squares. (b Re-classfed tranng set ncludng ambguous data n class IDK (trangles. (c Classfcaton of the orgnal tranng data after tranng wth a probablstc MLP. False classfcatons are marked wth sold symbols. (d Classfcaton of the re-classfed tranng set. (e,f Performance on a test set. (g,h Probablty surface of class a generated by the two networks. whch s approprate to gve the network outputs the probablstc nterpretaton [13]. t µ s thereby the target vector of tranng example µ wth component t = 1 f the example belongs to the class. In the examples below we use the Levenberg-Marquardt algorthm (LM [14] to tran the network on the tranng data set. 4. Partally Overlappng Classes: An Example Here we llustrate the proposed scheme usng two overlappng classes n a two dmensonal state space. In the followng we defne two classes a and b. The features x 1 and x 2 of the data from each class are thereby drawn from an equal dstrbuton wth overlappng areas: class a : class b : x 1 x [0,1]; 1 [0.6,1.6]; 2 x 2 [0,1] x [0.6,1.6]

An example of 100 data ponts of these two classes s shown n Fgure 1(a, data from class a as crcles and data from class b as squares. For comparson we traned a classfcaton network drectly on these tranng data. The networks had always 10 hdden nodes and were traned wth 100 LM tranng steps. The classfcatons of the tranng data wth ths network after tranng are shown n Fgure 1(c. Only 4 data ponts have not been classfed correctly. The network even learned to classfy most of the tranng data n the ambguous area. The re-classfcaton of these tranng wth the KNN algorthm descrbed above s shown n Fgure 1(b. Data wth components 0.6 < x 1 <1 are ambguous. K = 10 nearest neghbors (ncludng the data pont tself were taken nto account when choosng a new class structure for ths data set. The class of the data pont was set to the maorty of data f 80% or more of the neghborng data (ncludng the data pont tself were of ths maorty. If such a maorty could not be reached the data were classfed as class IDK and symbolzed wth open trangles n the fgures. These re-classfed data were used as tranng data for a second classfcaton network smlar to the prevous network. We only added one output node to account for the addtonal class IDK. The classfcaton of the re-classfed tranng data wth the classfcaton network after tranng s shown n Fgure 1(d. Only one data pont was not correctly classfed. More mportant than the performance of the classfcaton network on the tranng data s that of the performance on test data. An example s shown n Fgure 1(e and 1(f for the two separate classfcaton networks respectvely. The network that was traned wth ambguous data (Fgure 1(e falsely classfed 1/3 of the test data correspondng to a standard performance value of P' := n c /n = 0.67, where n c are the number of correct classfcatons and n are the number of data. As mght be expected there are of course numerous false classfcatons n the area wth ambguous data. However, what s even more dsturbng s that there are a lot of false classfcatons of data far away from ths area. Ths can also clearly be seen from the predcton surface of ths network, whch s llustrated n Fgure 1(g wth gray values. Whte correspond thereby to a confdence (value of the frst output node of 1 of predctng class a wth ths network, whereas black correspond to a confdence of 0 of predctng class a (whch correspond to a confdence of 1 to predct class b n ths example. It can clearly be seen that the attempt of the network to fnd a classfcaton scheme n the area wth ambguous data let to the proposal of structure that does not correspond to the underlyng problem. Ths structure s extrapolated to areas wthout ambguous data leadng to the pure performance on data n these areas of the nput space. The stuaton s much better wth the re-classfed data. The results of the classfcatons of the same test data wth the network traned on the re-classfed data are shown n Fgure 1(f. Only fve pattern have been falsely classfed, all of whch are close to the boundares of the area wth ambguous data. Ths correspond to a standard performance of P' = 0.95 (compared to 0.67 when takng all classes nto account. The mprovement comes largely from the fact that the underlyng problem has not any longer ambguous data, so that perfect classfcatons can be expected n the lmt of nfnte data. Ths s contrary to the orgnal problem whch ncludes ambguous data and hence a perfect classfcaton can not be archved n the lmt of nfnte tranng data. Indeed, n the example shown n Fgure 1(f, no data have been classfed wrongly when takng only predctons of class a and b nto account. Ths wll not always be the case but wll be true n the nfnte data lmt. Moreover, the areas far away from the ambguous area can be predcted wth hgh confdence, and the false classfcatons wll have a much lower confdence value. 4. Real World Data: Some Benchmark Examples The prevous example was ntended to descrbe our scheme and to llustrate why ths scheme should be useful. However, only data taken from real-world examples can tell f ths scheme s useful n practce. Hence, we wll report n the followng on some ntal study of the applcaton of ths scheme to some real world data. The followng examples are all taken from the UCI repostory of machne learnng databases [15] whch can be accessed va the Internet. 4.1. Irs Dataset We tested our scheme frst on the classcal Irs-data benchmark. The dataset contans 150 examples wth 4 physcal propertes of 3 members of the famly of rs flowers. The dataset was frst used by Fscher n 1936 [11] to llustrate multdata dscrmnant analyss technques n taxometrc problems. We dvded the dataset evenly nto a tranng set and a test set by takng 25 examples from each class nto each subset. The tranng dataset was re-classfed wth the same procedure as n the example of secton 3 wthout

adustng any parameters. Ths re-classfcaton run classfed 8 examples of the tranng data set as class IDK. The class label of all members of the frst class were preserved, whch showed that the tranng set of ths class dd not nclude ambguous data and was easly separable from the other classes. Ths fndng s n accordance wth smlar fndngs of Duch et al. [16]. We used a smlar network as n the prevous example for the classfcaton task tself; only the number of output nodes were adusted to represent the requred number of classes. After 100 tran steps the network was able to represent all data n the tranng sets, of the orgnal data as well as of the re-classfed data. The network made 4% false classfcatons (3 examples on the test set wth the network traned on the orgnal data. However, no false classfcatons were made n the classfcaton task wth the network traned on the re-classfed data when only takng classfcatons of rs types nto account. The prce to pay was that 11 examples of the test set were labeled as IDK. 4.2. Wsconsn Breast Cancer data The second test was made on medcal data compled by Dr. Wllam H. Wolberg at the Unversty of Wsconsn Hosptals, Madson. A smaller database of these data was ntally studed n [12]. The verson of the dataset we used contaned data from 699 patents wth 9 predctve attrbutes used n breast cancer dagnoss. The data were classfed n two classes: bengn (458 nstances, 65.5% and malgnant (241 nstances, 34.5%. Data from 16 patents were ncomplete. We gnored these records n the followng test. However, t should be stressed that ncomplete nformaton should lead to more ambguous data and should favor our approach. The effect of ncomplete data wll be dscussed n more detal elsewhere. We agan used the same KNN re-classfcaton algorthms as n the prevous example wthout adustng any parameters. The data were randomly dvded nto 340 tranng data and 343 test data. 14 data ponts were classfed as IDK wth the KNN re-classfcaton algorthm. All data ponts of the tranng sets, the orgnal on the re-classfed, were classfed correctly after tranng the networks. However, the network traned on the orgnal data classfed 23 nstances (6.7% of the test data ncorrectly, whereas the second network made only 6 mstakes (1.75%. 22 nstances of the test data were classfed as IDK. 5. Concluson and Outlook In ths paper we have proposed a scheme to solve classfcaton problems wth ambguous data. We showed that classfcaton problems wth ambguous data can lead to pure classfcaton results, not only for data n the ambguous areas but also for data n the areas whch should have a much better predctablty. We showed that the dentfcaton of ambguous nput areas and the use of re-classfed data for the tranng of the classfcaton algorthms can lead to an drastc reducton of false predctve classfcatons. Hence one should consder avodng predctons of some data n areas whch are hghly ambguous. We thnk that ths approach s very sutable when predctons have to be made wth partcular cauton. There are many ssues wthn ths scheme that have to be dscussed n the future. In partcular we used only a smple k-nn re-classfcaton scheme to dentfy ambguous areas n the nput space. We nether studed systematcally the dependence on the parameters of ths algorthm, nor dd we explore whch algorthms mght be best suted for a partcular problem. There are also a varety of classfcaton algorthms avalable, each whch mght have advantages for partcular applcatons. Our network can be mproved wth a Bayesan regularzaton scheme, whch can also provde addtonal nformaton on the complexty of the underlyng problem. Some work n ths drecton s n progress. Also other advanced classfcaton methods such as SVM can be used. Acknowledgment: We would lke to thank Wlodek Duch for the dscussons of hs results on rule extracton durng hs vst n Japan.

References [1] Bretman, L., Fredman, J.H., Olshen, R.A., Stone, C.J., Classfcaton and Regresson Trees, Wadsworth, Belmont, CA, 1984. [2] Buntne, W.L., Learnng classfcaton trees, Statstcs and Computng 2, 63-73, 1992. [3] Cover, T.M., Hart, P.E. Nearest neghbor pattern classfcatons, IEEE Transacton on Informaton Theory 13, 21-27, 1967. [4] Duda, R.O., Hart, P.E.m Pattern Classfcaton and Scene Analyss, Wley, New York, 1973. [5] Hanson R., Stutz, J., Cheeseman, P. Bayesan classfcaton wth correlaton and nhertance,proceedngs of the 12th Internatonal Jont Conference on Artfcal Intellgence 2, Sydney, Australa,Morgan Kaufmann, 692-698,1991. [6] Mche, D., Spegelhalter, D.J., Taylor, C.C., (edtors, Machne Learnng, Neural and Statstcal Classfcaton, Ells Horwood, 1994. [7] Rchard, M.D., Lppmann, R.P., Neural network classfers etmate Bayesan a-posteror probabltes, Neural Computaton 3, 461-483, 1991. [8] Tso, A.C., Pearson, R.A., Comparson of three classfcaton technques, CART, C4.5, and multlayer perceptrons, n Advances n Neural Informaton Processng Systems 3, 963-969, 1991. [9] Vapnk V., The Nature of Statstcal Learnng Theory, Sprnger Verlag, New York, 1995. [10] Vapnk, V., Golowch, S., Smola, A., Support vector method for functon approxmaton, regresson estmaton, and sgnal processng, n Advances n Neural Informaton Processng Systems 9, 1997. [11] Fscher R., The use of multple measurements n taxonomc problems, Annals of Eugencs 7, pp.179-188, 1936 [12] O. L. Mangasaran and W. H. Wolberg, Cancer dagnoss va lnear programmng, SIAM News, Volume 23, Number 5, September 1990, pp 1-18. [13] Amar, S., Backpropagaton and stochastc gradent desent methods, Neurocomputng 5, 185-196, 1993. [14] Hagan, M.T., Menha, M., Tranng feedforward networks wth the Marquardt algorthm, IEEE Transactons n Neural Networks, vol.5, no.6, pp.989-993, 1994 [15] Mertz, C.J., Murphy, P.M., UCI repostory of machne learnng databases, http://www.cs.uc.edu/pub/machne-learnng-databases [16] Duch W, Adamczak R, Grabczewsk K, Zal G (1999 Hybrd neural-global mnmzaton method of logcal rule extracton, Int. Journal of Advanced Computatonal Intellgence (n prnt.