Robust Subspace Outlier Detection in High Dimensional Space

Robust Subspace Outler Detecton n Hgh Dmensonal Space Zhana Noname manuscrpt No. In 202 Abstract Rare data n a large-scale database are called outlers that reveal sgnfcant nformaton n the real world. The subspace-based outler detecton s regarded as a feasble approach n very hgh dmensonal space. However, the outlers found n subspaces are only part of the true outlers n hgh dmensonal space, ndeed. The outlers hdden n normalclustered ponts are sometmes neglected n the projected dmensonal subspace. In ths paper, we propose a robust subspace method for detectng such nner outlers n a gven dataset, whch uses two dmensonal-projectons: detectng outlers n subspaces wth local densty rato n the frst projected dmensons; fndng outlers by comparng neghbor s postons n the second projected dmensons. Each pont s weght s calculated by summng up all related values got n the two steps projected dmensons, and then the ponts scorng the largest weght values are taken as outlers. By takng a seres of experments wth the number of dmensons from 0 to 0000, the results show that our proposed method acheves hgh precson n the case of extremely hgh dmensonal space, and works well n low dmensonal space. Keywords-Outler detecton; Hgh dmensonal subspace; Dmenson projecton; k-ns; I. INTRODUCTION Fndng rare and valuable data s always a sgnfcant ssue n data mnng feld. These worthy data are called anomaly data that are dfferent from the rest of the normal data based on some measures. They are also called outlers that are located far n dstance from others. Outler detecton has many practcal applcatons n dfferent domans, such as medcne development, fraud detecton, sports statstcs analyss, publc health management, and so on. Accordng to dfferent perspectves, many defntons about outlers are proposed. The wdely accepted defnton s Hawkns : an outler s an observaton that devates so much from other observatons as to arouse suspcon that t was generated by a dfferent mechansm[7]. Ths defnton not only descrbes the dfference of data from observaton but also ponts out the essental dfference of data n mechansm; even though some synthetc data are generated accordng to ths concept n order to verfy ther outlers detecton methods. Although outler detecton tself does not have a specal requrement for hgh dmensonal space, large-scale data are more practcable n the real world. There are two ssues for outler detecton n hgh dmensonal space: the frst one s to overcome the complexty n hgh dmensonal space, and the other s to meet the requrement for real applcatons wth the tremendous growth of hgh dmensonal data. In low dmensonal space, outlers can be consdered as far ponts from the normal ponts based on the dstance. However, n hgh dmensonal space, the dstance no longer meets the a. 3-Dmenson b. X-Y Dmenson c. X-Z Dmenson d. Y-Z Dmenson Fgure. Sample data plotted n three-dmensonal space and n twodmensonal spaces. Four red outlers separated n (a) are observed. But n (b), (c) and (d), only two red outlers are observed. Other two outlers are hdden n the normal clusters. exact descrpton between outlers and normal data. In ths case, detectng outlers falls nto two categores, dstancebased and subspace-based methods. The frst one uses robust dstance or densty n hgh dmensonal space,.e. [], Hlout[8], LOCI[3], Grd[4], ABOD[4], etc. These methods are sutable for the outler detecton n not hgh dmensonal space. However, n very hgh dmensonal space, they perform poor because of curse of dmensons. The other one that subspace based detecton s an optmum method to fnd outlers n hgh dmensonal space. It s based on the assumpton that the outlers n all low projected dmensonal subspaces are taken as real outlers n hgh dmensonal space. Ths soluton ncludes Aggarwal s Fracton[2], GLS-SOD[6], CURIO[5], SPOT[22], Grd- Clusterng[23], etc. Snce outlers are easly found n low projected dmensons usng some optmzed search algorthms to fnd sutable cell-grds that are dvsons of subspace, t s wdely used for outler detecton n hgh dmensonal space. Recent advance n geo-spatal, bonformatcs, genetcs and partcle physcs also requre more robust subspace detecton methods n growng hgh dmensonal data. However, one key ssue s stll uncertan: - Is that truth that the outlers detected n subspaces are all outlers n hgh dmensonal space?

In fact, subspace-based detecton methods can fnd some outlers dfferent from normal ponts n projected dmensonal space, but they gnore the outlers hdden nsde the regon of normal data. These nner outlers are stll dfferent from normal data n hgh dmensonal space. We show a smple example to prove the dfference between these two types of outlers separately n three-dmensonal space and projected two-dmensonal subspaces, as shown n Fg.. Total 24 ponts are dstrbuted n a three-dmensonal space, ncludng 20 normal ponts n sx clusters and 4 outlers n red color. In (a), four outlers can be found dfferently because they do not belong to any normal clusters. The outlers O 3 and O 4 are detected dfferent n any of the projected dmensonal spaces, whle the nner outlers O and O 2 are hdden nsde the clusters n the projected dmensonal space. Therefore, detectng O and O 2 fals. All subspacebased methods fal to detect these nner outlers, as shown n (b), (c) and (d). From the above, how to fnd all outlers wth subspace-based method s stll an ssue to be consdered. In ths paper, we try to solve ths ssue by utlzng the two dmensonal-projectons and propose a robust subspace detecton method called k-ns(k-nearest Sectons). It calculates ldr(local densty rato) n the frst projected dmensonal subspace and the nearest neghbors ldr n the second projected dmensonal subspace. Then, each pont s weght s summed statstcally. The outlers are those scorng the largest weghts. The man features and contrbutons of ths paper are summarzed as follows: - We apply two dmensonal-projectons to calculate the weght values n all projected dmensons. For each pont, we supply the ( m+ m (m ) ) weght values n order to compare t wth others extensvely. - Our proposed method employs k-ns(k-nearest Sectons) based on the k-nn (k Nearest Neghbor) concept for the local densty calculaton n the second projected dmensonal space. The nner outlers are detected successfully by evaluatng the neghbor s ldr after projectng them nto other dmensons. - We execute a seres of experments wth the range of dmensons from 0 to 0000 to evaluate our proposed algorthm. The experment results show that our proposed algorthm has advantages over other algorthms wth stablty and precson n hgh dmensonal dataset. - We also consder the dfference between the outlers and nosy data. The outlers are obvously dfferent n hgh dmensonal space wth nosy data whle they are mxed together n low dmensonal space. Ths paper s organzed as follows. In secton 2, we gve a bref overvew of related works on hgh dmensonal outler detecton. In secton 3, we ntroduce our concept and our approach, and we descrbe our algorthm. In secton 4, we evaluate the proposed method by experments of dfferent dmensonal datasets, artfcally generated and real datasets. At last, we conclude our fndngs n secton 5. II. RELATED WORKS As an mportant part of the data mnng, outler detecton has been developed for more than ten years, and many study results have been acheved n large scale database. We categorze them nto the followng fve groups. Dstance and Densty Based Outler Detecton: the dstance based outler detecton s a conventonal method because t comes from the orgnal outler defnton,.e. outlers are those ponts that are far from other ponts based on dstance measures, e.g. by Hlout[8]. Ths algorthm detects pont wth ts k-nearest neghbors by dstance and uses space-fllng curve to map hgh dmensonal space. The most well known [] uses k-nn and densty based algorthm, whch detects outlers locally by ther k-nearest dstance neghbor ponts and measures them by lrd (local reachablty densty) and lof(local Outler Factor). Ths algorthm runs smoothly n low dmensonal space and s stll effectve n relatve hgh dmensonal space. LOCI[3] s an mproved algorthm based on, whch s more senstve to local dstance than. However, LOCI does not perform well as n hgh dmensonal space. Subspace Clusterng Based Outler Detecton: snce t s dffcult to fnd outlers n hgh dmensonal space, they try to fnd these ponts behavng abnormally n low dmensonal space. Subspace clusterng s a feasble method for outler detecton n hgh dmensonal space. Ths approach assumes that outlers are always devated from others n low dmensonal space f they are dfferent n hgh dmensonal space. Aggarwal[2] uses the equ-depth ranges n each dmenson wth expected fracton and devaton of ponts n k k k k-dmensonal cube D gven by N f and N f ( f ). Ths method detects outlers by calculatng the sparse coeffcent S(D) of the cube D. Outler Detecton wth Dmenson Deducton: another method s dmenson deducton from hgh dmensonal space to low dmensonal space, such as SOM (Self-Organzng Map)[8,9], mappng several dmensons to two dmensons, and then detectng the outlers n two dmensonal space. FndOut[] detects outlers by removng the clusters and deducts dmensons wth wavelet transform on multdmensonal data. However, ths method may cause nformaton loss when the dmenson s reduced. The result s not as robust as expected, and t s seldom appled to outler detecton. Informaton-theory based Outler Detecton: n subspace, the dstrbuton of ponts n each dmenson can be coded for data compresson. Hence, the hgh dmensonal ssue s changed to the nformaton statstc ssue n each dmenson. Chrstan Bohm has proposed CoCo[9] method wth MDL(Mnmum Descrpton Length) for outler detecton, and he also apples ths method to the clusterng ssue, e.g. Robust Informaton-theory Clusterng[5,2]. Other Outler Detecton Methods: besdes above four groups, some detecton measurements are also dstnctve and useful. One notable approach s called ABOD (Angle- Based Outler Detecton)[4]. It s based on the concept of angle wth vector product and scalar product. The outlers usually have the smaller angles than normal ponts.

each dmenson. Here, we call the small regon a secton. Based on the secton dvson, we construct the new data structure called secton space. Second, we calculate the sparsty of pont n each secton n each dmenson by computng the ldr aganst average value n that dmenson. Thrd, we calculate the scatterng of the same secton ponts by ldr after projectng them from orgnal dmenson to other dmensons. Last, we sum up all the results as weght for each pont, and then compare all the ponts wth the score. The outlers are the ponts scorng the largest values of weght. Fgure 2. Secton Space Dvson and Dmenson Projecton The above methods have reduced the hgh dmensonal cures to some extent, and they get the correct results n some specal cases. However, the problem stll exsts and affects the pont s detecton accuracy. Chrstan Bohm s nformaton-theory based method s smlar to the subspace clusterng methods and suffers the same wth subspace-based outler detecton methods. In summary, seekng a general approach or mprovng the exsted subspace-based methods to detect outlers n hgh dmensonal space s stll a key ssue needs to be solved. III. PROPOSED METHOD It s known from last secton that not all outlers can be found n projected dmensonal subspace. The outlers falng to be detected n subspace are called nner outlers. The nner outlers are mxed n normal clusters n projected dmensonal subspaces, but they are detected anomaly n hgh dmensonal space. From another pont of vew, the nner outlers belong to several normal clusters n dfferent subspaces, but they do not belong to any cluster as a whole. In ths paper, the key msson s to fnd such nner outlers n hgh dmensonal space. A. General Idea Learnng from the subspace detecton methods, we know that hgh dmensonal ssue can be transformed nto the statstcal ssue by loop detecton n all projected dmensonal subspaces. Moreover, the ponts dstrbuton s ndependent n dfferent dmensons. By observng these ponts and learnng from exstng outler defntons, we have found that the outlers are placed n a cluster of normal ponts n a certan dmenson and devated n other dmensons. Otherwse, outlers are clustered wth dfferent normal ponts n dfferent dmensons whle normal ponts are always clustered together. Therefore, our proposed method needs to solve the two sub-ssues: how to fnd outlers effectvely n all projected-dmensonal subspaces; and how to detect the devaton of ponts of the same regon n one dmenson when these ponts are projected to other dmensons. Our proposal can be dvded nto four steps. Frst, we dvde the entre range of data nto many small regons n B. Secton Data Structure Our proposed method s based on the secton data structure. The mechansm on how to compose ths secton structure and transform the Eucldean data space nto our proposed secton space s ntroduced n below. We dvde the space nto the same number of equ-wdth sectons n each dmenson, so the space just looks lke a cell-grd. The conventonal data space nformaton s composed of ponts and dmensons whle our proposed data structure represents the data dstrbuton by pont, dmenson and secton. Ths structure has two advantages. Frst, the pont n whch secton s easly found out n all dmensons. Therefore, we can use all related secton calculated result to denote the pont s weght value. Second, t s easy for calculatng the dstrbuton change by checkng the ponts secton poston whle projectng them to dfferent dmensons. The data structure of PontInfo(pont nformaton) and SectonInfo(secton nformaton) cted n our proposal are shown as follows: PontInfo[Dmenson ID, Pont ID]: secton ID of the pont SectonInfo[Dmenson ID, Secton ID]: #ponts n the secton The PontInfo records each pont s secton poston n dfferent dmensons. The SectonInfo records the number of ponts of each secton n dfferent dmensons. The man calculaton of sparsty of ponts and dmenson projectons are processed based on these two data structures. The transformng process from the orgnal data space to the proposed secton-based space s explaned usng an example of two-dmensonal dataset, as shown n Fg. 2. The dataset ncludes 23 ponts n two-dmensonal space, as shown n Fg. 2(a). The orgnal data dstrbute n the data space based on Eucldean dstance s shown n Fg. 2(b). In our proposed secton-based structure, we construct the PontInfo structure as n Fg. 2(c) and SectonInfo structure n as n Fg. 2(d). The range of each dmenson s dvded nto fve sectons n ths example. The secton dvson s shown n Fg. 2(b) wth blue lnes. The data range of each dmenson may be dfferent. If we set the same data range for every dmenson coverng the maxmum of a certan dmenson, t would produce too many empty sectons n some dmensons. The empty sectons producng meanngless values 0 would affect the result markedly n the followng calculatons. Therefore, we set the mnmum data range n each dmenson coverng the only area where ponts exst. In order to avod two end-sectons

havng larger densty than that of other sectons, we extend the border by enlargng the orgnal range by 0.%. Takng the data n Fg. 2 as an example to explan how to generate the data range n each dmenson, the orgnal data range n x dmenson s (5, 23), and the length s 8. The new extendedrange s (4.99, 23.009) by enlargng the length by 0.%. Therefore, the new length s 8.08. The orgnal data range n y dmenson s (6, 25), and the length s 9. The new data range s (5.9905, 25.0095), and the new length s 9.09. The length of a secton s 3.6036 n x dmenson and 3.8038 n y dmenson. C. Defntons To our proposal, some defntons of notatons are gven n Table : Symbol P (pont) Secton (secton) scn (number of secton) d (secton densty) dsts (secton dstance) ldr sdr SI (statstc nformaton) Table. Defnton of Notatons Defnton The nformaton of pont. p j refers to the j th pont of all ponts. p, j refers to the j th pont n th dmenson. The range of data n each dmenson s dvded nto the same number of equ-wdth parts, whch are called sectons. The number of sectons for each dmenson. It s decded by the number of total ponts and the average secton densty. scn s defned equally n each dmenson. The number of ponts n one secton s called secton densty, d for short. The secton dstance used for evaluatng the secton dfference among ponts n all projected dmensons, as defned n (). Local densty rato, after ntroducng the Secton, t s replaced by sdr Secton densty rato, the calculaton s defned n (4) and (5) The statstc nformaton of each pont composed of all weghts, as defned n (6). The secton densty d wth dfferent subscrptons presents specfc meanngs n followng cases. Case : n a secton, all ponts of ths secton n a dmenson have the same secton densty, and d, j means secton densty value for the ponts n the j th secton n the th dmenson. Case 2: the secton densty s used to compare t wth the average densty n ths dmenson. So the low sectondensty means a low rato aganst the average secton densty n a dmenson. d means the average secton densty n the th dmenson. Case 3: f the secton densty of a pont s needed, the expresson wll nclude the pont. d (p) means the secton densty value of pont p n the th dmenson. In secton-based subspace, the secton denotes the pont s local area. The local densty s replaced by d. Then the ldr s replaced by sdr(secton densty rato). The process of two dmensonal-projectons s ntroduced n our proposal. Projectng ponts to each one-dmensonal subspace s the frst projecton. All ponts are checked n all the projected dmensonal subspaces. After that, the ponts n projected dmensons stll need to be checked between dfferent subspaces n order to detect nner outlers. Therefore, the ponts are projected agan from the frst projected dmenson to other dmensons, and then compare ther dstrbuton changes wth each other. It s called the second dmenson projecton. The whole procedure projects the ponts twce: from hgh dmenson to one dmenson and from one dmenson to other dmensons. D. k-nearest Sectons In ths secton, we descrbe the detecton methods n two steps. In the frst step, the sdr s employed to evaluate the sparsty of ponts n frst projecton dmensons. In the second step, whch s the key part of ths proposal, the scatterng of ponts after ther second projectons to other projected dmensons s calculated based on k-ns(k- Nearest Sectons). At last, we summarze these results of two steps statstcally. Before ntroducng the concept of k-ns, the dsts needs to be clarfed n advance. DEFINITION (dsts of ponts) Let pont p,q Secton. p, q are n the th dmenson. When p, q are projected from the dmenson to j, the secton dstance between them corresponds to the dfference of ther secton ID. dsts( p (), q ) = SecId( p j ) SecId( q j ) + Defnton() s used to measure the ponts scatter n the second projectons. In the dmenson before second projecton, assume that the ponts p and q are n the same secton. After applyng the second projecton from the dmenson to j, assume that the ponts p and q are located n dfferent sectons wth dfferent secton IDs. So we can compare the dstance of two ponts by the subtracton between SecId(p j ) and SecId(q j ) as n (). The dsts( p,q ) s defned as the absolute dfference value between the two ponts sectons, plus n order to avod the computatonal complexty of 0. In k-ns algorthm, dsts supples the effectve factor to evaluate the scatter of the ponts n the second projected dmensons. The defnton of outlers n k-ns s regarded as a statstc weght value, whch s decded by ts related calculated results n all projected dmensons. DEFINITION 2 (Outler n k-ns) The x ns of a gven pont x Secton n the database D R m s defned as follows: ' ' x = {x,x D x D,x,x Secton, p Secton ns m m m m m m ' << j j j j = = = j=,j = j=,j (2) d ( x ) d dsts( x, p ) dsts( x, p )} x, x` and p are the ponts n the same secton n the dmenson. The pont p s any of the neghbor ponts by the measure of dsts after applyng the second projecton. x ns s a statstcal result for summarzng all the values of x n two

dmensonal-projectons, whch means x ns of x can be used as a fnal result to detect outlers. By the k-ns defnton, outlers satsfyng ether of the followng two condtons are to be detected: frst, outlers that can be detected n the frst projecton; second, outlers that stll can be detected by dsts-based k-ns n the second projecton even f the pont does not appear abnormal n frst projecton. Although the x ns n () can reflect the outler result, t s dffcult to be calculated for each pont. Therefore, the general statstc nformaton for each pont s defned n (3): DEFINITION 3 (General Statstcal Informaton of Pont) Set sdr ( p ) s the calculated value of p k n the frst Pr oj,k projected dmenson and sdr ( p ) s the calculated Pr oj j value of p k after second projecton from the dmenson to the dmenson j. ω and ω 2 are the weght parameters for these two values. Then the statstcal nformaton value of p s expressed as follows: m m m SI ns( p k ) = ωsdr Pr oj ( p,k ) + ω2 sdr Pr oj ( p j j,k ) = = j=,j j sdr s used to calculate the densty rato of pont n two dmensonal-projectons. The detal calculaton method s ntroduced n (4) and (5). SI (Statstc Informaton) s the pont s fnal score by whch all the ponts are evaluated. The outler s SI value s obvously dfferent from normal pont s. For the dfferent dataset, adjust the weght values may brng the better result. ) Secton Densty Rato Calculated n frst Projected Dmenson Outlers always appear more sparsely than most normal ponts f they can be detected n projected dmensons. Therefore, the secton densty of outlers s lower than the average secton densty n that dmenson. In our proposal, sdr(secton Densty Rato) s cted for calculaton. The sdr of a pont not only reflect the sparsty compared wth others n that dmenson, but also keep ths value ndependent between dfferent dmensons. DEFINITION 4 (Secton Densty Rato) Set pont p,j Secton, γ n the dmenson, where j s the pont ID and γ s the secton ID. d, γ s the secton densty of pont p,j n dmenson, and d s the average secton densty n dmenson. The p, j s sdr s denoted by sdr of Secton γ, whch s defned as follows:, d, γ sdr Pr oj ( p, j ) sdr Pr oj ( Secton, γ ) = d (4) One pont to be notced s that one sdr(secton) does not only correspond to one pont, but t s shared by all the ponts n the same secton. Hence, the secton sdr ( Secton ) s assgned to the pont sdr ( p ). Pr oj, γ,k 2 Pr oj, j (3) Totally, m-tmes sdrpr oj are obtaned from all dmensons for each pont. Lemma. Gven a data set DB and pont p of DB n a secton of the dmenson, Card( Secton ) = scn, d ( p) = Count ( Secton( p )) and scn d Count( Secton,k ) scn k = =, d(p) f p s an outler, then <. d Where Card( Secton ) s the number of sectons n dmenson. Secton( p ) refers to the secton the pont p s n. Count ( Secton( p )) s the number of ponts n Secton( p ). Proof. Frst, we set the outler p s not n a normal s cluster n projected dmensons, otherwse Defnton(4) s to be appled. q, Count( Secton( p )) Count( Secton( q )) d ( p) = Count ( Secton( p )) Count( Secton( q )) n n j = Snce the secton densty d of outler p s less than most of ponts accordng to outler s defnton, c k= c k= n d ( p) Count( Secton( q )) j n = n j= Count( Secton ( q )) = Count( Secton k ) = d c So ( p) d d < k 2) k-nearest Secton Calculated n Second Projected Dmenson If outlers do not appear clearly n the low dmensons, they cannot be detected by the frst step snce they are hdden among the normal ponts and have smlar dstance or densty wth others. Nevertheless, these ponts stll can be detected n the second projected dmensons. Ths step ams to fnd outlers from normal ponts by projectng these ponts nto dfferent dmensons. The secton dstance measurement descrbes the sparsty of ponts to check them n second projected dmensons. Basng on the secton dstance concept and referrng to the k-nearest Neghbor concept[0], we can get the sdr of the nearest sectons of the pont n the projected dmensons. DEFINITION 5 (Nearest Sectons n Projected Dmenson) In the second dmenson projecton, the dmenson s projected from to j, Set p j, p f, q secton, the nearest secton neghbor N kn (p) of the pont p s defned as N kn( p ) = { q secton dsts( p,q ) k dsts( p )}. The pont q s one of k-nearest neghbor ponts. Count( Secton, γ ) = s.

Nkn s the number of p s neghbors. Then sdr Pr oj j of pont p,k s defned as follows: 2 dsts( p j,k,q)) Nkn q N kn (5) sdr Pr oj ( p j,k ) sdr Pr oj ( Secton j, γ ) = s 2 dsts( p j,f,q)) s N f = kn q Nkn We calculate the p k s dsts wth k-nearest neghbor ponts, and then get the rato value aganst the average value of ponts n the same secton. Whle the pont s projected to another dmenson, a sngle sdr Pr oj j value s calculated each tme of the projecton. Totally, m (m-) -tmes sdr Pr oj j values are obtaned from all the projected dmensons for each pont. Lemma 2. Gven a dataset DB and pont o, p,q Secton n a dmenson. Set a normal ponts cluster C, normal ponts p,q C, after the second projecton, ponts p, q, o are projected to dmenson j. p s o s the k th nearest neghbor, q s p s the k th nearest neghbor. If o s an outler, then dsts( o, p ) dsts( p,q ) Proof. For normal ponts, they belong to a cluster n all dmensons. Therefore, p,q C n dmenson j. o s an outler, so o C. If q s n the o s k neghbors, Then If p, q are on the o s same sde dsts( o, p ) dst( p,q ) If p, q are on the o s both sdes, p,q C o C o C o C, t s the contradcton! If q s not n the o s k neghbors, q s on the other sde of p If dsts( p,q ) > dsts( o, p ) o C o C o C, t s the contradcton! Then, dsts( o, p ) > dsts( p,q ) 3) Statstcal Informaton Values for Each Pont Through the above two steps calculaton, each pont gets m-tmes sdr Pr oj values at frst projecton and gets m (m )-tmes values n second projecton. sdr Pr oj j The sutable weghts for SI n (3) are consdered n order to gve the sharp boundary to compare ponts. By evaluatng dfferent weghtng values and ther performance, we choose smple and clear values. Here, we get the recprocal value of average sdr Pr oj and sdr, so we set weght ω Pr oj j = m and ω 2 =. The outlers have obvously larger SI than m(m ) that of the normal ponts. DEFINITION 6 (Statstc Informaton of pont) 2m SI ns(p k ) = m m (6) sdr Pr oj ( p,k ) sdr Pr oj ( p j,k ) + m = j= Equaton(6) sums up sdr values n all projecteddmensons. In low dmenson, the SI value for normal ponts should be close to, and outler s SI value should be obvously larger than. However, t s not true n hgh dmensonal space. Normal ponts SI are gettng close to outler s SI. Nevertheless, the outler s SI s stll obvously hgher than normal ponts. Therefore, outlers can be detected just by fndng ponts wth top largest SI values. E. Algorthm Now, we focus on how to mplement the k-ns method n R language. How to get PontInfo and SectonInfo effectvely n dfferent sectons and dmensons s a key ssue that needs to be consdered n detal. The proposed algorthm s shown n Table 2 wth pseudo-r code. Here, the dataset has n ponts n m dmensonal space. The range of data s dvded nto scn sectons n each dmenson. Table 2. k-ns Algorthm Algorthm: k-nearest Secton Input: k, data[n,m], scn Begn Intalze(PontInfo[n, m], SectonInfo[scn, m]) For = to m d =n/length(sectoninfo[ SectonInfo[,]!=0,]) For j= to n Get sdr Pr oj ( Secton, γ ) wth secton densty rato n (4) sdr Pr oj ( Secton, γ ) denote sdr Pr oj ( p, j ) (PontInfo[,j]=γ) End n End m For c= to 0 resort dmenson n random order For = to m For j= to scn PtNum <- SecInfo[j,] If(PtNum ==0) next Ptd <- whch(ptinfo[,]==j) If(PtNum < 3 k) sdr 2 Pr oj = j else For each(p n Ptd]) { f ( <m) =+ else = Get dsts(p Ptd, ) wth Defnton() Get sdr Pr oj j wth Defnton(5) } End j End End c Get SI value wth Defnton(6) for each pont Output: Outlers wth Pont ID (SI(p) >> SI or top SI score) Three ponts need to be clarfed n ths algorthm. The frst pont s how to decde the average sectondensty d n each dmenson. d value s obtaned by the defnton of the average secton densty n. It means d scn s same n each dmenson. However, we consder the specal case that most ponts are n several sectons and no pont s n other sectons. In ths case, d becomes very low and even close to the outler s secton densty. Therefore, we only count sectons wth ponts. Subsequently, d are vared n dfferent dmensons. Hence, the rato of the secton densty aganst d n Defnton (4) can measure the sparsty of ponts n dfferent sectons of a dmenson. The second pont s the number of ponts n one secton. There are three dfferent cases.

Case : no pont n the secton. In ths case, the algorthm just passes ths secton and goes to the next secton. Case 2: many ponts n the secton. In ths case, the nearest sectons method s used drectly to detect ponts. Case 3: only a few ponts n the secton. In ths case, the pont dstrbuton s dffcult to be judged just by these several ponts. In addton, the secton densty rato n the step must be very low. Therefore, these ponts are to be already detected by the prevous step. Here, we pass ths secton too. The threshold value to separate the case 2 and case 3 s related to the k. k should not be large because k s less than d n the step 2. Through experments wth values from 4 to 20 to fnd the sutable value for k and the threshold of the number of ponts n one secton, we have found that the threshold value can be defned as 3 k as the best soluton 2 whch could be used n most of the stuatons. F. Complexty Analyss The three-step procedure s consdered separately to state the complexty of the k-ns algorthm. In the frst step, t calculates the secton densty n each projected dmenson. The tme complexty s O(m n). In the second step, k- nearest sectons densty s calculated between projected dmensons. The tme complexty s O(scn (m-) m). It s notced that all ponts n the secton are used, so the tme complexty expresson s changed to O(n (m-) m). In the last step, summng up all weght values for each pont, the tme complexty s O(n). Hence, the total complexty tme s T( n) = O + O 2 + O3 = O ( m n ) + O ( n ( m ) m ) + O ( n ) 2 O( n m ) The k-ns takes more processng tme on calculatng n the loop of dmenson projectons and fndng the related pont s secton n each dmenson. The space complexty of k-ns s S n = O 2 scn m + 3 m n + m + n ( ) ( ) O3 ( m n) We need to record the necessary nformaton and ntermedate result for the pont and the secton. The temporary room needed n the procedure s just a lttle. G. Dstncton between Outlers and Nosy Ponts The concept of outler and nosy pont has been proposed for more than ten years. Accordng to that, outler s regarded as abnormal data, whch s generated by a dfferent mechansm and contans valuable nformaton, and nosy data are regarded as a sde product of clusterng ponts, whch have no useful nformaton but affect the correct result greatly. In the data space, outlers are the ponts that are farther from others by some measures, whle the nosy ponts always appear around the outlers. Snce the nosy ponts are also far away from the normal ponts, n low dmensonal space, t s dffcult to make a dstnct boundary between outlers and nosy ponts. Based on ths frustrated (a) (b) Fgure 3. Nosy Data of Dataset 5 Projected to Two-Dmensonal Space and Dataset 3 Projected to Two-Dmensonal Space observaton, some researchers even consder that nosy pont s as a knd of outlers. There s no dfference n detectng abnormal data by any methods. Hence, t s a meanngful ssue to make them dfferent between outlers and nosy ponts not only n concept but also n detecton measures. In ths paper, we try to explan the dstncton between outlers and nosy ponts n two aspects. The frst s that there are dfferent data generaton processes. Outlers are generated by a dfferent dstrbuton from normal ponts. Nosy ponts have the same dstrbuton wth normal ponts. The second s that abnormal states are dfferent n dmensonal space. Outlers appear abnormal n most of the dmensons. Nosy ponts only appear abnormally n several dmensons and appear normal n other dmensons. From the whole dmensons vew, these nosy data also conform to the same dstrbuton of normal data. The outler may appear n the same way n low dmensonal space, but they conform to dfferent dstrbuton mechansm from normal ponts. Therefore, t shows the dfference between outlers and nosy ponts n some projected-dmensonal spaces. An example of nosy data s shown n Fg. 3(a). The data s retreved from Dataset 8 as ntroduced n secton 4, whch contans 000 ponts n 0000 dmensons. The outlers are placed n the mddle regon and can be found dfferently from normal ponts. The nosy ponts are labeled wth a cloud symbol that s so dfferent n ths projected twodmensonal space. Another example s shown n Fg. 3(b). The outlers are not always obvous n low projected dmensonal space, whle nosy ponts that are dstrbuted on the margnal area of both dmensons are lkely abnormal ponts. IV. EVALUATION We have mplemented our algorthm and appled t to several hgh dmensonal datasets, and then have made the comparson between k-ns, and LOCI. In order to compare these algorthms under far condtons, we performed them wth R language, on a Mac Book Pro wth 2.53GHz Intel core 2 CPU and 4G memory. A. Synthetc Datasets A crtcal ssue of evaluatng outler detecton algorthms s that there are no benchmark datasets avalable n a real world to satsfy the explct dvson between outlers and normal ponts. The ponts that are found as outlers n some

Precson.0 0.8 0.6 0.4 k-ns Precson.0 0.9 0.8 0.7 0.6 k-ns F-measure.0 0.8 0.6 0.4 k-ns 0.2 0.5 0.2 0.0 0.4 0.0 0.0 0.2 0.4 0.6 0.8.0 0.0 0.2 0.4 0.6 0.8.0 0 2 3 4 5 6 7 8 Recall Recall Dataset (a) Precson-Recall n Dataset wth 0 Dmensons (b) Precson-Recall n Dataset 2 wth00 Dmensons (c) F-measure of Dataset -8 Fgure 4. Effectveness Comparson between and k-ns n Eght Datasets from dmenson 0 to 0000. Precson-Recall n (a) and (b); F-measure n (c) Table 3. Experment Dataset real dataset are mpossble to provde a reasonable explanaton why these ponts are pcked out as outlers. On the other hand, what we have learned from the statstcal knowledge s helpful to generate the artfcal dataset: f some ponts wth some dstrbutons are apparently dfferent from those of normal ponts, these ponts can be regarded as outlers. Hence, we generate the synthetc data based on ths assumpton. We generate the eght synthetc datasets wth ponts of 500-000 and dmensons of 0-0000. The normal ponts conform to the normal dstrbutons whle outlers conform to the random dstrbutons n a fxed regon. Normal ponts are dstrbuted n fve clusters wth random μ and σ, and 0 outlers are dstrbuted randomly n the mddle of normal ponts range. The more detals about the parameters n each dataset are shown n Table 3. The experment datasets are generated by the rules that outlers range should be wthn the range of the normal ponts n any dmensons. Therefore, outlers cannot be found n low dmensonal space. The data dstrbuton example s shown n Fg. 3(b) where the Dataset 3 s projected to two-dmensonal space wth outlers labeled wth red color. It s clearly shown that the outlers are wthn the range of normal ponts and appear no dfference wth the normal ponts n ths two-dmensonal space. Nosy ponts that are placed on the margn of dstrbuted area are more lkely regarded as abnormal ponts. Hence, outlers and normal data cannot be separated just by the straght observaton of the dfferent dstrbutons. B. Effectveness Frst, we conduct the two-dmensonal experment usng the dataset n Fg. 2. The result show that three algorthms perform well. Our proposed algorthm can run on low dmensonal dataset. Next, our proposed algorthm s evaluated thoroughly by a seres of experments and compared t wth. LOCI s excluded for comparson because t performs poor n every dataset. In order to measure the performance of these algorthms wth precson and recall, the 0 outlers are repreved one by one. In the evaluaton of all eght dataset experments, we obtaned 0 precsons and the 0 recalls respectvely n every dataset, and obtan the 0 F-measures. We pck up the hghest F-measure from each dataset for demonstratng the experment performance by and k- NS. At the begnnng, we need to set all the approprate parameters n the eght expermental datasets. The parameters are the best ones for the prepared datasets, and they are changed accordng to the data sze and the number of dmensons. The parameter Knn of s set around 0 n all the experments snce the dataset sze s only 500 or 000 ponts. Ths s a reasonable rato of neghbor ponts aganst the whole dataset sze. For our algorthm, the parameters of d and scn are nverse each other. The product of d and scn s equal to n. We set scn a lttle larger than d, because these combnatons of parameters have shown the better experment results. The 0-dmensonal experment result s shown n Fg. 4(a). performs best n ths 0-dmensonal experment. Especally, can detect two outlers wth very hgh precson. Nevertheless, the precson of falls down sharply wth the ncreasng recall from 20% to 40%. At last, the result of precson s worse than k-ns n detectng all outlers correctly. As a whole, the performance of k-ns s below. The reason that performance s poor n both algorthms s that the outlers are placed n the center of normal data n our datasets, whch prevents these outlers to be found n low dmensonal space. Therefore, t s dffcult to fnd exact outlers n 0-dmensonal space. When the number of dmenson ncreases to 00, the precson and recall n the 2 nd dataset clearly show the

Tme(Sec) 6000 5000 4000 3000 2000 000 0 k-ns 0 2 3 4 5 6 7 8 Dataset Fgure 5. Runnng Tme effectveness of these algorthms. Dfferent from the frst dataset, k-ns acheves 00% precson wth any recall all the tme. obvously reduces the precson from 00% to 43.48% wth the ncreasng recall from 70% to 00%, as shown n Fg. 4(b). In fact, k-ns keeps the perfect result n 00 dmensons, whle performs much poorer n terms of the precson and the recall. The experments of the datasets from to 8 are shown n Fg. 4(c). needs to pck the largest F-measure for each dataset, whle k-ns only needs to pck the largest F-measure for the frst dataset. In addton, F-measures of k-ns are always on the datasets 2 to 8. The experments show that k-ns performs perfectly n fnd nner outlers n hgh dmensonal space. s suffered the curse of hgh dmenson greatly. We fnd that the precson become better when the dataset sze s ncreased; but t does not for. C. Effcences We also compare these algorthms n runnng-tme. In R language, the runnng tme ncludes user tme, system tme and total tme. So we only use the user tme to compare them. As shown n Fg. 5, s faster n all experments. The two algorthms take more tme when the number of dmensons or the data sze ncrease. The reason s that there s no dmenson-loop calculaton for because t only processes the dstance between a pont and ts neghbors. However, our proposed algorthm calculates the values n all the frst projected dmensons and all the second projecteddmensons. D. Performance on Real World Data In ths subsecton, we compare these algorthms wth a real-world dataset publcly avalable at the UCI machnelearnng repostory[24]. We use Arcene dataset that s provded by ARCECE group. The task of the group s to dstngush cancer versus normal patterns from mass-spectrometrc data. Ths s a two-class classfcaton problem wth contnuous nput varables. Ths dataset s one of fve datasets from the NIPS 2003 feature selecton challenge. The orgnal dataset ncludes total 900 nstances wth 0000 attrbutes. The datasets have tranng dataset, valdatng dataset and test dataset. Each sub-dataset s K-NS No. SI Pt ID SI Pt ID k-ns Mxed.432643 35 0.06458754 89 Pt 2.394097 3 0.06294790 22 Pt Recall Pt Recall Pt Recall 3.338282 40 0.06890 34 0 3 30% 2 20% 5 50% 4.27947 63 0.0652080 32 20 7 70% 3 30% 9 90% 5.230594 5 0.05985524 237 6.96492 0 0.05924606 34 b 7.93774 8 0.05894665 90 8.75922 37 0.057544 50 90 9.72092 05 0.05748593 309 80 k-ns 0.56304 245 0.05737442 242 MIX.55840 32 0.05633002 80 70 2.46046 33 0.0564289 78 60 3.40580 263 0.05590464 50 4.34468 34 0.05566309 48 40 5.32395 2 0.05548088 3 30 6.24824 36 0.05533588 38 7.8983 224 0.0554495 26 20 8.8789 37 0.05485238 29 0 2 4 6 8 20 Pont Number 9.5959 80 0.05483668 92 20.09533 63 0.05476878 92 a c Fgure 6. Top 20 Ponts Detected n Arcane Data labeled wth postve and negatve except for test dataset. For 700 nstances n test dataset, we only know 30 nstances are postve and 390 nstances are negatve. The best_svm_result s avalable at [25]. 308 nstances are labeled wth postve, and 392 nstances are labeled wth negatve. We use ths SVM result for evaluatng and our proposal, where we create a dataset by addng randomly selected 0 negatve nstances to the retreved 308 postve nstances by SVM. The frst evaluaton uses ths dataset wth total 38 nstances. The second evaluaton uses the retreved 392 negatve nstances, and we apply two algorthms to detect outler from them. The result of the frst experment s shown n Fg. 6. The 20 top ponts are chosen n both algorthms. SI s the score for pont, and the Pt ID s pont ID. The ponts wth Pt ID larger than 308 are true outlers. Three outlers and two outlers are detected n and k-ns n the top ten ponts. Totally, fve outlers are detected by mxed result combned wth both algorthms. In the top twenty ponts, seven outlers and three outlers are detected by and k-ns. Totally, nne outlers are detected wth mxed result. In both results, the s better than k-ns. However, the k-ns can help to ncrease the detecton accuracy from 30% to 50% n 0 ponts, 70% to 90% n 20 ponts. In another word, the k- NS supply a reasonable alternatve soluton to ncrease the precson results. As a contrast, we also gve the LOCI result, whch output pont ID (8, 20, 48, 95, 53, 89, 93, 242, 307, 3, 35, 37). Its recall s 30%, the same as k-ns. However, all the outlers detected n LOCI are also detected by. Table 4. Recall % Top 5 Ponts detected n Arcane Data

In the second experment, there are two postve ponts mss-clarfed by SVM. Therefore, fndng these two ponts s the task of ths experment. As seen n Table 4 showng the results, pont IDs of 29 and 82 are the most probably outlers by the ntersecton of and k-ns results. It s noted that the both ponts appear n the top three detected ponts of the both results. If we consder the LOCI result, the ntersectons pont ID s 53 and 75, whch s entrely dfferent from k-ns. Nevertheless, contrast wth former results, the frst concluson seems more reasonable. V. CONCLUSTION In ths paper, we ntroduce a new defnton of nner outler, and then present a novel method, called k-ns, desgned to detect such nner outlers wth the top largest score n a hgh dmensonal dataset. The algorthm s based on a statstcal method wth three steps. () Calculate the secton densty rato of each pont n each dmenson after frst projecton. () Compute the nearest sectons densty rato of each pont n all projected dmensons after second projecton. () Summarze all sdr values of each pont and denoted as a weght value (SI), then compare SI wth those of the other ponts. Each pont gets totally m+m (m-) values to be compared. Expermental results on synthetc datasets wth dmenson from 0 to 0000 have shown that our proposed k-ns algorthm has the followng advantages: Immune to the curse of hgh dmensons, Adapt to varous outler dstrbutons, Show outstandng performance on detectng nner outlers n hgh dmensonal data space. The dfference between outlers and nosy data s also dscussed n ths paper. Ths ssue s dffcult n low dmensonal space. In our experments, the nosy data and outler are found dfferently by comparng the dstrbuton n projected dmenson and whole dmensons, and the nosy data seem to more abnormal than outlers n some projected dmensonal spaces n our cases. As the ongong and future work, we contnue to mprove the algorthm by fndng the best relatonshp for two-step sdr. Besdes performng the dataset wth the hgh dmensons, the dataset wth large-scale data sze or ncrement updates nstead of computng t over the entre dataset to the outler detecton need to be conducted. Another ssue s the expensve cost of the processng tme n hgh dmensonal space. Any soluton to reduce the processng tme needs to be nvestgated. One of the approaches may be the use of the parallel processng. REFERENCES [] Markus M.Breung, Hans-Peter Kregel, Raymond T.Ng, Jorg Sander. : Indetfy densty-based local outlers. Proceedngs of the 2000 ACM SIGMOD nternatonal conference on Management of data. [2] Charu C.Aggarwal, Phlp S.Yu. Outler detecton for hgh dmensonal data. Proceedngs of the 200 ACM SIGMOD nternatonal conference on Management of data. [3] Spros Papadmtrou, Hroyuk Ktagawa, Phllp B.Gbbons. LOCI: fast outler detecton usng the local correlaton ntegral. IEEE 9 th Internatonal conference on data engneerng 2003. [4] Hans-peter Kregel, Matthas Schubert, Arthur Zmek. Angle-based outler detecton n hgh dmensonal data. The 4th ACM SIGKDD nternatonal conference conference on Knowledge dscovery and data mnng. 2008. [5] Chrstan Bohm, Chrstos Faloutsos, etc. Robust nformaton theoretc clusterng. The 2th ACM SIGKDD nternatonal conference conference on Knowledge dscovery and data mnng. 2006. [6] Zhana, Wataru Kameyama. A Proposal for Outler Detecton n Hgh Dmensonal Space. The 73 rd Natonal Conventon of Informaton Processng Socety of Japan, 20. [7] D. Hawkns. Identfcaton of Outlers. Chapman and Hall, London, 980. [8] Fabrzo Angull and Clara Pzzut. Outler mnng n large hghdmensonal data sets. IEEE Transactons on Knowledge and Data Engneerng (TKDE), 7(2):203-25, February 2005. [9] Chrstan Bohm, Katrn Haegler. CoCo: codng cost for parameterfree outler detecton. In Proceedngs of the 5th ACM SIGKDD nternatonal conference conference on Knowledge dscovery and data mnng. 2009. [0] Alexandar Hnnerburg, Charu C. aggarwal, Danel A. Kem. What s the nearest neghbor n hgh dmensonal space? Proceedngs of the 26th VLDB Conference, 2000. [] Dantong Yu,etc. FndOut: fndng outlers n very large datasets. Knowledge and Informaton System (2002) 4:387-42. [2] Chrstan Bohm, Chrstos Faloutsos, etc. Outler-robust clusterng usng ndependent components. Proceedngs of the 2008 ACM SIGMOD nternatonal conference on Management of data. [3] De Vres, T., Chawla, S., Houle, M.E., Fndng Local Anomales n Very Hgh Dmensonal Space, 200 IEEE 0th Internatonal Conference on Data Mnng(ICDM), pp.28-37, 3-7 Dec. 200. [4] Anny La-me Chu and Ada Wa-chee Fu, Enhancements on Local Outler Detecton. Proceedngs of the Seventh Internatonal Database Engneerng and Applcatons Symposum (IDEAS 03) [5] Aaron Ceglar, John F.Roddck and Davd M.W.Powers. CURIO: A fast outler and outler cluster detecton algorthm for larger datasets. AIDM '07 Proceedngs of the 2nd nternatonal workshop on Integratng artfcal ntellgence and data mnng. Australa, 2007. [6] Feng chen, Chang-Ten Lu, Arnold P. Boedhardjo. GLS-SOD: a generalzed local statstcal approach for spatal outler detecton.. In Proceedngs of the 6th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng. 200. [7] Mchal Valko, Branslav Kveton, etc. 20. Condtonal Anomaly Detecton wth Soft Harmonc Functons. In Proceedngs of the IEEE th Internatonal Conference on Data Mnng (ICDM '). [8] Ashok K. Nag, Amt Mtra, etc. Multple outeler detecton n multvarate data usng self-organzng maps ttle. Computatonal statstcs. 2005.20:245-264. [9] Teuvo kohonen. The self-organzng map. Proceedngs of the IEEE, Vol.78, No.9, September, 990. [20] Naok Abe, Banca Zadrozny, John Langford. Outler Detecton by Actve Learnng. Proceedngs of the 2th ACM SIGKDD nternatonal conference. 2006. [2] J Zhang, etc. Detectng projected outlers n hgh dmensonal data streams. In Proceedngs of the 20th Internatonal Conference on Database and Expert Systems Applcatons (DEXA '09). [22] Alexander Hnneburg, Danel A. Kem. Optmal grdclusterng:towards breakng the curse of dmensonalty n hgh dmensonal clusterng. The 25 th VLDB conference 999. [23] Amol Ghotng, etc. Fast Mnng of Dstance-Based Outlers n Hgh- Dmensonal Datasets. Data Mnng and Knowledge Dscovery. Vol.6:349-364, 2008. [24] http://archve.cs.uc.edu/ml/datasets/arcene(vsted on May6th,202). [25] http://clopnet.com/sabelle/projects/nips2003/analyss.html#svmresu (vsted on May 6th,202).