An efficient weighted nearest neighbour classifier using vertical data representation

Size: px

Start display at page:

Download "An efficient weighted nearest neighbour classifier using vertical data representation"

Francis Gardner
6 years ago
Views:

1 Int. J. Busness Intellgence and Data Mnng, Vol. x, No. x, xxxx 1 An effcent weghted nearest neghbour classfer usng vertcal data representaton Wllam Perrzo Department of Computer Scence, North Dakota State Unversty, P.O. Box 5164, Fargo, ND , USA E-mal: wllam.perrzo@ndsu.edu Qn Dng* Department of Computer Scence, Pennsylvana State Unversty at Harrsburg, Mddletown, PA 17057, USA Fax: E-mal: qdng@psu.edu *Correspondng author Maleq Khan Department of Computer Scence, Purdue Unversty, 250 N. Unversty St., West Lafayette, IN 47907, USA E-mal: mmkhan@cs.purdue.edu Anne Denton Department of Computer Scence and Operatons Research, North Dakota State Unversty, P.O. Box 5164, Fargo, ND , USA E-mal: anne.denton@ndsu.edu Qang Dng Jangsu Telecom Co., Ltd., Huan Cheng Nan Lu #88, Nantong, Jangsu , Chna E-mal: qdng74@gmal.com Abstract: The k-nearest neghbour (KNN) technque s a smple yet effectve method for classfcaton. In ths paper, we propose an effcent weghted nearest neghbour classfcaton algorthm, called PINE, usng vertcal data representaton. A metrc called HOBBt s used as the dstance metrc. The PINE algorthm apples a Gaussan podum functon to set weghts to dfferent neghbours. We compare PINE wth classcal KNN methods usng horzontal and vertcal representaton wth dfferent dstance metrcs. The expermental results show that PINE outperforms other KNN methods n terms of classfcaton accuracy and runnng tme. Copyrght 200x Inderscence Enterprses Ltd.

2 2 W. Perrzo, Q. Dng, M. Khan, A. Denton and Q. Dng Keywords: nearest neghbours; classfcaton; data mnng; vertcal data; spatal data. Reference to ths paper should be made as follows: Perrzo, W., Dng, Q., Khan, M., Denton, A. and Dng, Q. (xxxx) An effcent weghted nearest neghbour classfer usng vertcal data representaton, Int. J. Busness Intellgence and Data Mnng, Vol. x, No. x, pp.xxx xxx. Bographcal notes: Wllam Perrzo s a Professor of Computer Scence at North Dakota State Unversty. He holds a PhD Degree from the Unversty of Mnnesota, a Master s Degree from the Unversty of Wsconsn and a Bachelor s Degree from St. John s Unversty. He has been a Research Scentst at the IBM Advanced Busness Systems Dvson and the USA Ar Force Electronc Systems Dvson. Hs areas of expertse are data mnng, knowledge dscovery, database systems, dstrbuted database systems, hgh-speed computer and communcatons networks, precson agrculture and bonformatcs. He s a member of ISCA, ACM, IEEE, IAAA and AAAS. Qn Dng s an Assstant Professor n Computer Scence at Pennsylvana State Unversty at Harrsburg, USA. She receved her PhD n Computer Scence from North Dakota State Unversty, USA, MS and BS n Computer Scence from Nanjng Unversty, Chna. Her research nterests nclude data mnng, database, bonformatcs and spatal data. Maleq Khan receved BS n Computer Scence and Engneerng from Bangladesh Unversty of Engneerng and Technology, Bangladesh and MS n Computer Scence from North Dakota State Unversty, ND, USA. He s currently workng toward the PhD Degree n Computer Scence at Purdue Unversty, IN, USA. Hs research nterests nclude dstrbuted algorthms, communcaton networks, wreless sensor networks, bonformatcs and data mnng. Anne Denton s an Assstant Professor n Computer Scence at North Dakota State Unversty, USA. Her research nterests are n data mnng, knowledge dscovery n scentfc data and bonformatcs. Specfc nterests nclude data mnng of dverse data, n whch objects are charactersed by a varety of propertes such as numercal and categorcal attrbutes, graphs, sequences, tme-dependent attrbutes and others. She receved her PhD n Physcs from the Unversty of Manz, Germany and her MS n Computer Scence from North Dakota State Unversty, Fargo, ND, USA. Qang Dng receved hs MS and PhD n Computer Scence from North Dakota State Unversty, USA, BS n Computer Scence from Nanjng Unversty, Chna. He was an Assstant Professor n Computer Scence at Concorda College, MN, USA, before he joned Jangsu Telecom Co., Ltd., Chna. Hs research nterests nclude database and data mnng. 1 Introducton The nearest neghbour technque (Cover and Hart, 1967; Cover, 1968; Dasarathy, 1991; Duda and Hart, 1973; McLachlan, 1992; Safar, 2005) s a smple yet effectve method for classfcaton. Gven a set of tranng data, a KNN classfer predcts the class value for a new sample X by searchng the tranng set for the KNNs of X based on certan dstance

3 An effcent weghted nearest neghbour classfer 3 metrc and then assgnng to X the most frequent class occurrng n ts KNNs. The value of k,.e., the number of neghbours, s pre-specfed. Nearest neghbour classfers are lazy learners n that a classfer s not bult untl a new sample needs to be classfed. They dffer from eager classfers, such as decson tree nducton (Breman et al., 1984; Qunlan, 1993; Jeong and Lee, 2005; James, 1985; Fu, 2005). In classcal KNN methods, data are represented n horzontal format. Each tranng observaton s a tuple that conssts of n features (or called attrbutes). Each tme when a new sample needs to be classfed, the entre tranng data set s scanned to fnd the KNNs of ths new sample. The database scan can be costly f the tranng data s extremely large. In ths paper, we propose a nearest neghbour algorthm, called Podum Incremental Neghbour Evaluator (PINE), usng vertcal data representaton. Vertcal data representaton can usually provde hgh compresson rato for large datasets. Vertcal representaton s partcularly sutable for spatal data, not only because the spatal data set s typcally large, but also because the contnuty of spatal neghbourhood provdes potental for even hgher compresson rato usng vertcal representaton. In most KNN methods, each of the KNNs casts an equal vote for the class of the new sample. In the PINE algorthm, we apply a Gaussan podum functon to dfferent neghbours wth dfferent weghts. The podum functon s chosen n such a way that the neghbours close to the sample have hgher mpact than others n determnng the class of the sample. In addton, all the samples n the database are consdered neghbours, though some samples have very low or even zero weghts. By dong so, there wll be no need to pre-specfy the k value and more mportantly, hgh classfcaton accuracy wll be acheved. The dea of usng weghts n KNN s not new. It s common to assgn weghts to dfferent features n the dstance metrcs snce some features may be more relevant than others n terms of closeness or dstance (Fu and Wang, 2005). Weghts can also be assgned to the nstances. Cost and Salzberg presented a method of attachng weghts to dfferent nstances for learnng wth symbolc features (Cost and Salzberg, 1993). Some nstances are more relable and are consdered exemplars. These relable exemplars are gven smaller weghts to make them appear closer to a new sample. In our work, based on the dstances from the new sample, we partton the entre tranng set nto dfferent groups (called rngs ) and then assgn a weght to each rng usng a podum functon. The podum functon establshes a rser heght for each step of the podum weghtng functon as the dstance from the sample grows. The dea of a podum functon s smlar to the concept of a radal bass functon (Nefeld and Psalts, 1993; Al and Smth, 2005). In ths paper, we use a vertcal data representaton called P-tree 1 (Perrzo et al., 2001a; Dng et al., 2002). P-trees are compact and data-mnng-ready structure, whch provdes lossless representaton of the orgnal horzontal data. P-trees represent bt nformaton that s obtaned from the data through a separaton nto bt planes. The mult-level structure s used to acheve hgh compresson. A consstent mult-level structure s mantaned across all bt planes of all attrbutes. Ths s done so that a smple mult-way logcal AND operaton can be used to reconstruct count nformaton for any attrbute value or tuple. Dfferent metrcs can be defned for closeness of two data ponts. In ths paper, we use a metrc called Hgh Order Basc Bt smlarty (HOBBt) (Khan et al., 2002). HOBBt dstance metrc s partcularly sutable for spatal data.

4 4 W. Perrzo, Q. Dng, M. Khan, A. Denton and Q. Dng We run the PINE algorthm on very large real mage datasets. Our expermental results show that t s much faster to buld a KNN classfer usng vertcal data representaton than usng horzontal data representaton for very large datasets. It also shows that the PINE algorthm acheves hgher classfcaton accuracy than other KNN methods. The rest of the paper s organsed as follows. In Secton 2, we revew the vertcal P-tree structure. In Secton 3, we ntroduce the HOBBt metrc and detal our PINE algorthm. Performance analyss s gven n Secton 4. Secton 5 concludes the paper. 2 The vertcal P-tree structure revew We use a vertcal structure, called a Peano Count Tree (P-tree), to represent the data. We frst splt each attrbute value nto bts. For example, for mage data, each band s an attrbute and the value s represented as a byte (8 bts). The fle for each ndvdual bt s called a bsq fle. For an attrbute wth m-bt values, we have m bsq fles. We organse each bsq bt fle, B j (the fle constructed from the jth bts of th attrbute), nto a P-tree. An example of an 8 8 bsq fle and ts P-tree s gven n Fgure 1. Fgure 1 P-tree and PM-tree In ths example, 39 s the count of 1 s n the bsq fle, called root count. The numbers at the next level, 16, 8, 15 and 16, are the 1-bt counts for the four major quadrants. Snce the frst and last quadrant s made up of entrely 1-bts and 0-bts respectvely (called pure1 and pure0 quadrant respectvely), we do not need sub-trees for these two quadrants. Ths pattern s contnued recursvely. Recursve raster orderng s called the Peano or Z-orderng n the lterature therefore, the name P-trees. The process wll eventually termnate at the leaf level where each quadrant s a 1-row-1-column quadrant. For each band, assumng 8-bt data values, we get 8 basc P-trees, one for each bt poston. For band B we wll label the basc P-trees, P,1, P,2,, P,8, where P,j s a lossless representaton of the jth bt of the values from the th band. However, the P,j provdes much more nformaton and s structured to facltate many mportant data mnng processes. For effcent mplementaton, we use a varaton of P-trees, called PM-tree (Pure Mask tree), n whch mask nstead of count s used. In the PM-tree, 3-value logc s used,.e., 11 represents a pure1 quadrant, 00 represents a pure0 quadrant and 01 represents a mxed quadrant. To smplfy, we use 1 for pure1, 0 for pure0 and m for mxed. Ths s llustrated n the thrd part of Fgure 1.

5 An effcent weghted nearest neghbour classfer 5 P-tree algebra contans operators AND, OR, NOT and XOR, whch are the pxel-by-pxel logcal operatons on P-trees (Dng et al., 2002). The NOT operaton s a straghtforward translaton of each count to ts quadrant-complement. The AND and OR operatons are shown n Fgure 2. Fgure 2 P-tree algebra (AND and OR) The basc P-trees can be combned usng smple logcal operatons to produce P-trees for the orgnal values (at any level of precson, 1-bt precson, 2-bt precson, etc.). We let P b,v denote the P-tree for attrbute, b and value, v, where v can be expressed n any bt precson. Usng the 8-bt precson for values, P b, can be constructed from the basc P-trees as: P b, = P b1 AND P b2 AND P b3 AND P b4 AND P b5 AND P b6 AND P b7 AND P b8 Where ndcates the NOT operaton. The AND operaton s smply the pxel-wse AND of the bts. Smlarly, the data n the relatonal format can be represented as P-trees also. For any combnaton of values (v 1,v 2,, v n ), where v s from attrbute-, the quadrant-wse count of occurrences of ths combnaton of values s gven by: P P P P ( v1, v2,, v ) = 1, AND 1 2, AND AND n v v 2 n, vn P-trees are mplemented n an effcent way, whch facltates fast P-tree operatons and consderable savng of space. More detals can be found n Dng et al. (2002) and Perrzo et al. (2001b).

6 6 W. Perrzo, Q. Dng, M. Khan, A. Denton and Q. Dng 3 Dstance-weghted nearest neghbour classfcaton usng vertcal data representaton In classcal KNN classfcaton technques, there s a lmt placed on the number of neghbours. In our dstance-weghted neghbour classfcaton approach, the podum or dstance weghtng functon establshes a rser heght for each step of the podum weghtng functon as the dstance from the sample grows. Ths approach gves users maxmum flexblty n choosng just the rght level of nfluence for each tranng sample n the entre tranng set. The real queston s, can ths level of flexblty be offered wthout mposng a severe penalty wth respect to the speed of the classfer. Tradtonally, sub-samplng, neghbour-lmtng and other restrctons are ntroduced precsely to ensure that the algorthm wll fnsh ts classfcaton n reasonable tme. The use of the compressed, data-mnng-ready data structure, the P-tree, n fact, makes PINE even faster than tradtonal methods. Ths s crtcally mportant n classfcaton snce data are typcally never dscarded and therefore the tranng set wll grow wthout bound. The classfcaton technque must scale well or t wll quckly become unusable n ths settng. PINE scales well snce ts accuracy ncreases as the tranng set grows whle ts speed remans very reasonable (see the performance study below). Furthermore, snce PINE s lazy (does not requre a tranng phase n whch a closed form classfer s pre-bult), t does not ncur the expensve delays requred for rebuldng a classfer when new tranng data arrves. Thus, PINE gves us a faster and more accurate classfer. Before explanng how dstance-weghted neghbour classfer PINE usng P-trees works, we gve an overvew of the technque of KNN classfcaton and llustrate how to calculate KNN usng P-trees wthout consderng weghts. In KNN the basc dea s that the tuples that most lkely belong to the same class are those that are smlar n the other attrbutes. Ths contnuty assumpton s consstent wth the propertes of a spatal neghbourhood. Based on some pre-selected dstance metrc or smlarty measure, such as Eucldean dstance, classcal KNN fnds the k most smlar or nearest tranng samples to an unclassfed sample and assgns the pluralty class of those k samples to the new sample (Dasarathy, 1991; Nefeld and Psalts, 1993). The value for k s pre-selected by the user based on the accuracy requred (usually the larger the value of k, the more accurate the classfer) and the delay tme requred for classfyng wth that k-value (usually the larger the value of k the slower the classfer). The steps of the classfcaton process are: determne a sutable dstance metrc fnd the KNNs usng the selected dstance metrc fnd the pluralty class of the KNNs (votng on the class labels of the KNNs) assgn that class to the sample to be classfed. We use a dstance metrc called HOBBt, whch provdes an effcent method of computaton based on P-trees. Instead of examnng ndvdual tranng samples to fnd the nearest neghbours, we start our ntal neghbourhood of the target sample wthn a specfed dstance n the feature space based on ths metrc, and then successvely expand the neghbourhood area untl there are at least k tuples n the neghbourhood set.

7 An effcent weghted nearest neghbour classfer 7 Of course, there may be more boundary neghbours equdstant from the sample than are necessary to complete the KNN set, n whch case, one can ether use the larger set or arbtrarly gnore some of them. Tradtonal methods fnd the exact KNN set, snce that s easest usng tradtonal technques although t s clear that allowng some samples at that dstance to vote and not others, wll skew the result. Instead, the P-tree-based KNN approach bulds a closed nearest neghbour set (closed-knn) (Khan et al., 2002), that s, we nclude all of the boundary neghbours. The nductve defnton of the closed-knn set s gven below. a If x KNN, then x closed-knn b If x closed-knn and d(t, y) d(t, x), then y closed-knn, where, d(t, x) s the dstance of x from target T c Closed-KNN does not contan any tuple whch cannot be produced by steps (a) and (b). Expermental results show closed-knn yelds hgher classfcaton accuracy than KNN does. If there are many tuples on the boundary, ncluson of some but not all of them skews the votng mechansm. The P-tree mplementaton requres no extra computaton to fnd the closed-knn. Our neghbourhood expanson mechansm automatcally ncludes the entre boundary of the neghbourhood. P-tree algorthms avod the examnaton of ndvdual data ponts, whch mproves the classfcaton effcency. 3.1 Hgher Order Basc bt (HOBBt) dstance Usng a sutable dstance or smlarty metrc s very mportant n KNN methods because t can strongly affect performance. There s no metrc that works best for all the cases. Research has been done to fnd the rght metrc for the rght problem (Short and Fukanaga, 1980, 1981; Fredman, 1994; Myles and Hand, 1990; Haste and Tbshran, 1996; Domencon et al., 2002). The most commonly used dstance metrc s Eucldean metrc. For two data ponts, X = <x 1, x 2, x 3,, x n 1 > and Y = <y 1, y 2, y 3,, y n 1 >, the Eucldean smlarty functon s defned as 2 n 1 d ( X, Y) = ( x y ) = 1 It can be generalsed to the Mnkowsk smlarty functon, n 1 (, ) q q q = = 1 d X Y w x y 2 If q = 2, ths gves the Eucldean functon. If q = 1, t gves the Manhattan dstance, whch s 1 n 1 d ( X, Y) = x y = 1

8 8 W. Perrzo, Q. Dng, M. Khan, A. Denton and Q. Dng If q =, t gves the max functon n 1 d ( X, Y) = max x y. = 1 In ths paper we use the HOBBt dstance metrc (Khan et al., 2002), whch s a natural choce for spatal data. The HOBBt metrc measures dstance based on the most sgnfcant consecutve bt postons startng from the left (the hghest order bt). Smlarty or closeness s of nterest. When comparng two values btwse from left to rght, once a dfference s found, any further comparsons are not needed. The HOBBt smlarty between two ntegers A and B s defned by S H (A, B) = max{s 0 s a = b } (1) where a and b are the th bts of A and B respectvely. The HOBBt dstance between two tuples X and Y s defned by n 1 H = 1 H d ( X, Y) = max{ m S ( x, y )} (2) where m s the number of bts n bnary representatons of the values; n 1 s the number of attrbutes used for measurng dstance (the nth beng the class attrbute); and x and y are the th attrbutes of tuples X and Y. The HOBBt dstance between two tuples s a functon of the least smlar parng of attrbute values n them. From the experments, we found that HOBBt dstance s more sutable for spatal data than other dstance metrcs. 3.2 Closed-KNN To fnd the closed KNN set, frst we look for the tuples whch are dentcal to the target tuple n all 8 bts of all bands,.e., the tuples, X, havng dstance from the target T, d H (X, T) = 0. If, for nstance, t 1 = 105 ( b = 105 d ) s a target attrbute value, the ntal nterval of nterest s [105, 105] ([ , ]). If over all tuples the number of matches s less than k, we compare attrbutes on the bass of the seven most sgnfcant bts, not carng about the 8th bt. The expanded nterval of nterest would be [104,105] ([ , ] or [ , ]). If k matches stll have not been found, removng one more bt from the rght gves the nterval [104, 107] ([ , ]). Contnung to remove bts from the rght we get ntervals, [104, 111], then [96, 111] and so on. Ths process s mplemented usng P-trees as follows. P,j s the basc P-tree for bt j of band and P,j s the complement of P,j. Let b,j be the jth bt of the th band of the target tuple, and for mplementaton purposes let the representaton of the P-tree depend on the value of b j. Defne: Pt,j = P,j f b,j = 1, = P,j, otherwse.

9 An effcent weghted nearest neghbour classfer 9 Then the root count of Pt,j s the number of tuples n the tranng dataset havng the same value as the jth bt of the th band of the target tuple. Defne: Pv,1 j = Pt,1 & Pt,2 & Pt,3 & & Pt,j (3) where & s the P-tree AND operator and n s the number of bands. Pv,1 j counts the tuples havng the same bt values as the target tuple n the hgher order j bts of th band. Then a neghbourhood P-tree can be formed as follows: Pnn(j) = Pv 1,1 j & Pv 2,1 j & Pv 3,1 j & & Pv n 1,1 j (4) We calculate the ntal neghbourhood P-tree, Pnn(8), matchng exactly n all bands, consderng 8-bt values. Then we calculate Pnn(7), matchng n seven hgher order bts; then Pnn(6) and so on. We contnue as long as the root count of Pnn(j) s less than k. Let us denote the fnal Pnn(j) by Pcnn. Pcnn represents the closed-knn set and the root count of Pcnn s the number of the nearest tuples. A bt of one n Pcnn for a tuple means that the tuple s n the closed-knn set. For the purpose of classfcaton, we do not need to consder all bts n the class band. If the class band s 8 bts long, there are 256 possble classes. Instead, we partton the class band values nto fewer, say 8, groups by truncatng the 5 least sgnfcant bts. The 8 classes are 0, 1, 2,, 7. Usng the leftmost 3 bts we construct the value P-trees P n (0), P n (1),, P n (7). The P-tree Pcnn & P n () represents the tuples havng a class value that are n the closed-knn set, Pcnn. An, whch yelds the maxmum root count of Pcnn & P n () s the pluralty class; that s predcted class = arg max{rc( Pcnn & P ( ))} (5) where, RC(P) s the root count of P. 3.3 PINE: weghted nearest neghbour classfer usng vertcal data representaton The contnuty assumpton of KNN tells us that tuples that are more smlar to a gven tuple have more nfluence on classfcaton than tuples that are less smlar. Therefore gvng more votng weght to closer tuples than dstant tuples ncreases the classfcaton accuracy. Instead of consderng the KNNs, we nclude all of the ponts, usng the largest weght, 1, for those matchng exactly, and the smallest weght, 0, for those furthest away. Fgure 3 shows the neghbourhood rngs usng HOBBt metrc. Many weghtng functons whch decrease wth dstance can be used (e.g., Gaussan, Krgng, etc.). A smple example of such functon s a lnear podum functon (shown n Fgure 4) whch decreases step-by-step wth dstance. Note that the HOBBt dstance metrc s deally suted to the defnton of neghbourhood rngs, because the range of ponts that are consdered equdstant grows exponentally wth dstance from the centre. Adjustng weghts s partcularly mportant for small to ntermedate dstances where the podums are small. At larger dstances where fne-tunng s less mportant the HOBBt dstance remans unchanged over a large range,.e., podums are wder. Ideally, the 0-weghted rng should nclude all tranng samples that are judged to be too far away (by a doman expert) to nfluence class. n

10 10 W. Perrzo, Q. Dng, M. Khan, A. Denton and Q. Dng Fgure 3 Neghbourhood rngs usng HOBBt metrc Fgure 4 An example of a lnear podum functon We number the rngs from 0 (outermost) to m (nnermost). Let w j be the weght assocated wth the rng j. Let c j be the number of neghbour tuples n the rng j belongng to the class. Then the total weght vote by the class s gven by: m V () = wc (6) j= 0 Ths can easly be transformed to: j j m m m V () = w0 cl + ( wj wj 1) ck k= 0 j= 1 k= j (7)

11 An effcent weghted nearest neghbour classfer 11 Let crcle j be the crcle formed by the rngs j, j + 1,, m, that s, the rng j ncludng all of ts nner rngs. Referrng to equaton (4), the P-tree, Pnn(j), represents all of the tuples n the crcle j. Therefore, {Pnn(j) & P n ()} represents the tuples n the crcle j and class ; P n () s the P-tree for class. Hence: m ck = RC{ Pnn( j) & Pn ( )} (8) k= j m V( ) = w RC{ Pnn(0) & P ( )} + [( w w )RC{ Pnn( j) & P ( )}] (9) 0 n j j 1 n j= 1 An whch yelds the maxmum weghted vote, V(), s the pluralty class or the predcted class; that s: predcted class = arg max{ V ( )} (10) 4 Performance analyss We have performed experments to evaluate PINE on real data sets. An example data set s gven n Fgure 5. Ths data set ncludes an aeral TIFF mage (wth Red, Green and Blue bands) and assocated ground data, such as ntrate, mosture and yeld, of the Oaks area n North Dakota. In ths data set, yeld s the class label attrbute,.e., the goal s to predct the yeld based on the values of other attrbutes. Ths data set and some other data sets are avalable at Fgure 5 An mage dataset: (a) TIFF mage; (b) ntrate map; (c) mosture map and (d) yeld map

12 12 W. Perrzo, Q. Dng, M. Khan, A. Denton and Q. Dng We tested KNN wth Manhattan, Eucldean, Max and HOBBt dstance metrcs; closed-knn wth the HOBBt metrc and PINE. In PINE, HOBBt was used as the dstance functon and the Gaussan functon was used as the podum functon. The functon s exp( (2 2 d )/(2 σ 2 )), where d s the HOBBt dstance and the varance σ s 2 4. The mappng of the functon s gven n Table 1. Table 1 Gaussan weghs as the functon of HoBBt dstance HOBBt dstance Gaussan wegh The comparson of the accuracy s gven n Fgure 6. We observed that PINE acheves hgher classfcaton accuracy than closed-knn does. Especally when the tranng set sze ncreases, the mprovement of PINE over closed-knn s more apparent. In addtonal, PINE performs much better than classcal KNN methods usng varous metrcs. All these classfers work better than raw guessng, whch s 12.5% n the data set wth eght classes. Fgure 6 Comparson of classfcaton accuracy for KNN usng dfferent metrcs, closed-knn and PINE In terms of runnng tme, from Fgure 7, we observed that both closed-knn and PINE are much faster than classcal KNN methods usng varous metrcs (notce that both the sze and classfcaton tme are plotted n logarthmc scale). The reason behnd ths s because both PINE and closed-knn use vertcal data representaton, whch provdes fast calculaton of nearest neghbours. We can also see that PINE s slghtly slower than

13 An effcent weghted nearest neghbour classfer 13 closed-knn. Ths s expected because n PINE all the tranng samples are ncluded as neghbours whle n closed-knn only k samples (wth possbly extra ponts on the boundary) are selected as neghbours to be nvolved n the calculaton. However, ths addtonal tme cost s relatvely small. On the average, PINE s eght tmes faster than classcal KNN, and closed-knn s ten tmes faster. Both PINE and closed-knn ncrease at a lower rate than classcal KNN methods do when the tranng set sze ncreases. Fgure 7 Comparson of classfcaton tme per sample (sze and classfcaton tme are plotted n logarthmc scale) 5 Conclusons In ths paper, we propose a weghted nearest neghbour classfer, called PINE. PINE uses a vertcal data structure P-tree and a dstance metrc called HOBBt. A Gaussan podum functon s used to weght the dfferent neghbours based on the dstance from the sample data. PINE does not requre provdng the number of neghbours n advance. PINE s partcular useful for classfcaton on large data sets, such as spatal data sets, whch have hgh compresson rato usng vertcal data representaton. Performance analyss shows that PINE outperforms classcal KNN methods n terms of accuracy and speed of classfcaton on spatal data. In addton, PINE has potental for effcent classfcaton on data streams. In data streams, new data keep arrvng, so the classfcaton effcency wll be an mportant ssue. PINE provdes hgh effcency as well as accuracy for classfcaton on data streams. As the areas of data mnng applcatons grow (Chen and Lu, 2005), our approach has potental to be extended to some of these areas, such as DNA mcroarray data analyss and medcal mage analyss.

14 14 W. Perrzo, Q. Dng, M. Khan, A. Denton and Q. Dng Acknowledgement Ths work s partally supported by GSA Grant K References Al, S. and Smth, K.A. (2005) Kernel wdth selecton for SVM classfcaton: a meta-learnng approach, Internatonal Journal of Data Warehousng and Mnng, Idea Group Inc., Vol. 1, No. 4, pp Breman, L., Fredman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classfcaton and Regresson Trees, Wadsworth, Belmont. Chen, S.Y. and Lu, X. (2005) Data mnng from 1994 to 2004: an applcaton-orentated revew, Internatonal Journal of Busness Intellgence and Data Mnng, Inderscence Publshers, Vol. 1, No. 1, pp Cost, S. and Salzberg, S. (1993) A weghted nearest neghbour algorthm for learnng wth symbolc features, Machne Learnng, Vol. 10, No. 1, pp Cover, T.M. (1968) Rates of convergence for nearest neghbour procedures, Proceedngs of Hawa Internatonal Conference on Systems Scences, Honolulu, Hawa, USA, pp Cover, T.M. and Hart, P. (1967) Nearest neghbour pattern classfcaton, IEEE Trans. on Informaton Theory, pp Dasarathy, B.V. (1991) Nearest-Neghbour Classfcaton Technques, IEEE Computer Socety Press, Los Alomtos, CA. Dng, Q., Khan, M., Roy, A. and Perrzo, W. (2002) The P-tree algebra, Proceedngs of ACM Symposum on Appled Computng, pp Domencon, C., Peng, J. and Gunopulos, D. (2002) Locally adaptve metrc nearest neghbour classfcaton, IEEE Trans. on Pattern Analyss and Machne Intellgence, Vol. 24, No. 9, pp.1, 281 1, 285. Duda, R.O. and Hart, P.E. (1973) Pattern Classfcaton and Scene Analyss, Wley, New York. Fredman, J. (1994) Flexble Metrc Nearest Neghbour Classfcaton, Techncal report, Stanford Unversty, Stanford, CA, USA. Fu, L. (2005) Novel effcent classfers based on data cube, Internatonal Journal of Data Warehousng and Mnng, Idea Group Inc., Vol. 1, No. 3, pp Fu, X. and Wang, L. (2005) Data dmensonalty reducton wth applcaton to mprovng classfcaton performance and explanng concepts of data sets, Internatonal Journal of Busness Intellgence and Data Mnng, Inderscence Publshers, Vol. 1, No. 1, pp Haste, T. and Tbshran, R. (1996) Dscrmnant adaptve nearest neghbour classfcaton, IEEE Trans. on Pattern Analyss and Machne Intellgence, Vol. 18, No. 6, pp James, M. (1985) Classfcaton Algorthms, John Wley & Sons, New York. Jeong, M. and Lee, D. (2005) Improvng classfcaton accuracy of decson trees for dfferent abstracton levels of data, Internatonal Journal of Data Warehousng and Mnng, Idea Group Inc., Vol. 1, No. 3, pp Khan, M., Dng, Q. and Perrzo, W. (2002) k-nearest Neghbour classfcaton on spatal data streams usng P-trees, Proceedngs of Pacfc and Asa Knowledge Dscovery and Data Mnng (PAKDD), Lecture Notes n Artfcal Intellgence, Vol. 2336, pp McLachlan, G.J. (1992) Dscrmnant Analyss and Statstcal Pattern Recognton, Wley, New York. Myles, J.P. and Hand, D.J. (1990) The mult-class metrc problem n nearest neghbour dscrmnaton rules, Pattern Recognton, Vol. 23, pp.1, 291 1, 297.

15 An effcent weghted nearest neghbour classfer 15 Nefeld, M.A. and Psalts, D. (1993) Optcal mplementatons of radal bass classfers, Appled Optcs, Vol. 32, No. 8, pp Perrzo, W., Dng, Q., Dng, Q. and Roy, A. (2001a) On mnng satellte and other remotely sensed mages, Proceedngs of ACM Workshop on Research Issues on Data Mnng and Knowledge Dscovery, Santa Barbara, CA, USA, pp Perrzo, W., Dng, Q., Dng, Q. and Roy, A. (2001b) Dervng hgh confdence rules from spatal data usng Peano count trees, Lecture Notes n Computer Scence, Vol. 2118, pp Qunlan, J.R. (1993) C4.5: Programs for Machne Learnng, Morgan Kaufmann. Safar, M. (2005) K nearest neghbour search n navgaton systems, Moble Informaton Systems, IOS Press, Vol. 1, No. 3, pp Short, R. and Fukanaga, K. (1980) A new nearest neghbour dstance measure, Proceedngs of Internatonal Conference on Pattern Recognton, Mam Beach, FL, USA, pp Short, R. and Fukanaga, K. (1981) The optmal dstance measure for nearest neghbour classfcaton, IEEE Trans. on Informaton Theory, Vol. 27, pp Note 1 P-tree technology s patented by North Dakota State Unversty (Wllam Perrzo, prmary nventor of record); patent number 6,941,303 ssued September 6, Webste TIFF mage data sets, Avalable at

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department