Outlier Detection Methodologies Overview

Outler Detecton Methodologes Overvew Mohd. Noor Md. Sap Department of Computer and Informaton Systems Faculty of Computer Scence and Informaton Systems Unverst Teknolog Malaysa 81310 Skuda, Johor Bahru, Malaysa mohdnoor@fksm.utm.my Ehsan Moheb Department of Computer and Informaton Systems Faculty of Computer Scence and Informaton Systems, Unverst Teknolog Malaysa 81310 Skuda, Johor Bahru, Malaysa saeh_hamo@yahoo.com Abstract The Outler detecton problem s an mportant ssue n many safety crtcal envronments. Outlers arse due to mechancal faults, changes n system behavor, fraudulent behavor, human error, nstrument error or smply through natural devatons n populatons. The most popular outler detecton methods that have been suggested so far are densty and dstrbuton based methods that employ a metrc equaton to consder the outlers. On the other hand some methods apply neural network methodologes to keep track of outlers. In ths paper we compare recent known outler detecton technques and consder the strength and weakness of each approach separately. Keywords Outlers, Statstc, Spatal data, K-NN. 1. Introducton Outlers can be defned as gven by [1], An outler s an observaton that devates so much from other observatons as to arouse suspcon that t was generated by a dfferent mechansm. In fact Statstcal approaches were the earlest algorthms used for outler detecton, whch are suted to quanttatve real-valued data sets or at the very least quanttatve ordnal data dstrbutons. One of the earlest outler detecton methods has been suggested by [2] whch calculates a Z value as the dfference between the mean value for the attrbute and the result value s dvded by the standard devaton where the mean and standard devaton are calculated from all attrbute values. The common crteron that s beng used for outler detecton s K- Nearest Neghbor algorthm. In ths case to fnd outlers, all the neghbors of each pont should be calculated wth the complexty of, m s dmenson and n s the number of ponts. Ths method s expensve for large data sets and hgh dmensonal data sets. [3], [4] and [5] have proposed new methods to overcome ths ssue. All K-NN methods use a dstance calculaton metrc such as Eucldean or Mehalanobs dstance to measure the dstances between each pont. The later one s so expensve because t calculates the correlaton matrx ( ) between all the related pont records. One of the most popular and wdely studed clusterng methods for objects n Eucldean space proposed [6] whch s called k- means clusterng algorthm. K-means method requres the users to specfy the value of k clusters and ths model provdes a local model of data. The algorthm represents each of k clusters by a prototype vector wth attrbute values equvalent to the mean values across all ponts n the cluster. It updates cluster centers to ndcate the new nstance. In secton 2 of ths paper we dscussed dstance based outler detecton method. Densty based also wll be represented n secton 3. In secton 4 we represent two types of spatal outler detecton methodology. In the last part of ths paper we wll represent dscusson and concluson n the case of tme complexty. 2. Dstance Based Knorr and Ng (1998) presented an effcent K-NN algorthm, whch s effcent because t does not calculate all k neghbors of, only m<k neghbors wll be determned. In fact t's not senstve to computatonal growth. It means that dealng wth large data set ths method wll result n an acceptable tme complexty. The outler s defned as followng: If there are less than neghbors nsde the dstance threshold then the nstance s an outler. But the consdered shortcomng s that user should defne parameters and n advance. Ths knd of problems may be susceptble to fndng normal ponts as false outlers and vce versa. Ramaswamy (2000) ntroduced an optmzed K-NN outler detecton method, whch produce a lst of potental outlers. Ths optmzng method was just mxng K-NN wth parttonng data nto cells, Ramaswamay Used Nested loop, Index based and partton based algorthm to defne outlers. In ths method the outler s defne as below: p s an outler f no more than (n-1) ponts n data set have hgher (dstance to the neghbor), whch m s user defned. But ts complexty s not good for computatonal growth, because all K-NN must be calculated.. The result of the tme complexty for these three algorthms has been compared (see fgure 1), s number of nstances. A drawback of method that was proposed by Ramaswamay s that user has to know n advance how many outlers there are n the data set [7], because n some cases only one outler ts dstance to neghbor s so large whch s clearly n sparse space and obvously detected as outler.

Fgure 1 Performance Results for N [5] 3. Densty Based To acheve better result for fndng nterestng outlers and overcome some of the shortcomngs of dstance based method ( capture global outlers, etc...), densty based method has been proposed. M. Breung et al (2000) ntroduced the concept of local densty outlers and a measure LOF (Local Outler Factor), whch captures the degree of outler-ness of every object n the data set, to pck up local outlers. Aggrawal and Yu (2001) use a lower dmensonal projecton of data set and focus on key attrbutes. Then used an evolutonary search algorthm and The Brute-force algorthm whch examne all -dm projectons and retan the projecton whch have the most negatve sparsty coeffcent, Then usng the searchng algorthm to fnd the outlers. In ths proposed method all ponts wthn the same cell are regarded as normal objects or outlers. Therefore, ths method has a drawback that sometmes normal objects may be detected as outlers, and vce versa. G. Kollos et al (2003) proposed a densty based based samplng method to detectt, outlers. A kernel densty estmator s bult usng randomly sampled ponts to approxmately represent the densty of the data set. The estmator can be used to estmate the probablty that each data pont belongs to the data set. For each object, the functon, s defned to be the number of objects whose dstance s at most from the object x n the data set. the defnton of outlers s as followng: An object s a, -outler only f,. The proposed algorthm takes one pass over the data set to compute the densty estmator functon, and the complexty of ths step s. Snce each object n the data set has to be read once n order to compute the value of,, one full data set scan s needed. The complexty of ths step s, where s the number of samples for constructng the densty estmator. One drawback of ths method s that a large number of wll mprove the accuracy but ncrease the runnng tme complexty. In fact how good a kernel densty estmator can work n hgh- dmensonal space has not been fully explored but t seems to be less accurate. We wll dscuss a dfferent densty estmaton strategy to overcome some shortcomng of Brto s method [7]. Brto et al ( 1997) proposed a Mutual -Nearest Neghbor (MkNN) graph based approach. MkNN graph s a graph where an edge exsts between vectors and f they both belong to each other s - case of neghborhood. MkNN graph s undrected and s a specal -Nearest Neghbor (knn) graph, n whch every node has ponters to ts -nearest neghbors. Each connected component s consdered as a cluster f, t contans more than one vector and an outler when connected component contans only one vector. Potental problem wth Berto s defnton s that, an outler that s too close to an nler could be msclassfed [7]. To have a good performance and mprove the Berto s method, Hautamak et al (2004) proposed an outler detecton method usng In-degree Number (ODIN) algorthm that utlzes -nearest neghbor graph. In ths method the defnton of outler s: Gven knn graph for data set, outler s a vertex, whose n-degree s less than equal to threshold. Where s a dfferent varant of Ramaswamay s defnton,.e. t measured from maxmum knn dstances (, as followng: max 0 1, Expermental results show that ODIN makes a good performance and produces less error rate n synthetc data sets to comparson wth Berto and Ramaswamay s methodology. Bay and Schwabacher (2003) proposed an approach that can detect outlers n near lnear runnng tme wth the data set sze. Indeed, ths method s an optmzed verson of the nested loop algorthm by makng use of the technque of randomzaton and a smple prunng rule. The data set randomzed and dvded nto small blocks, and the blocks are handled one by one. For the frst block, each object s compared wth every object n the whole data set n order to compute ts score (whch s the dstance to ts nearest neghbor) ). Accordng to these scores, the top outlers n the frst block can be decded, and the score of the outler s used as a cut-off for the second block. As more blocks have been processed, more extreme outlers can be found and a larger cut-off can be used for the next block. As a result, prunng becomes more effcent after each teraton. But the procedure of randomzng the whole data set s mportant for ths method. The performance can be very poor f the data set s sorted or the objects clustered together n space also appear together n the data set fle. In fact the man shortcomng s that ths method needs to scan the whole data set tmes, where s the number of blocks. When the whole data set cannot ft n the man memory, expensve dsk scans could result n very poor performance. Even though the worse case complexty s stll, the expermental results show that ths method can acheve near lnear runnng tme. One of the shortcomngs of Knorr et al proposed method s that t cannot acheve good performance wth very large datasets and hgh dmensonal datasets. To overcome such dsadvantage, D.Ren et al (2004) mproved knorr s method by ntroducng the defnton of processng vertcal structure nstead of tradtonal horzontal structure. The defnton of neghborhood of a data pont wth the radus s defnedd as followng, where s the dataset:,,

And the defnton of outlers s:,,, 1, They proposed a vertcal by-neghbor outler detecton method wth local prunng (PODMP) 1, whch can detect outlers effcently and scale well n large datasets. The vertcal method works as follows. Frst, the dataset to be mned s represented as the set of P-Trees. Secondly, one pont n the dataset s selected arbtrarly; then, the -neghbors are searched usng the fast computaton of nequalty P-Tree, and the -neghbors are represented wth an nequalty P-Tree, whch s called a neghborhood P-Tree. In the neghborhood P-Tree, 1 means the pont s a neghbor of the pont, whle 0 means the pont s not a neghbor. Thrdly, the number of ponts n -neghbors s calculated effcently by extractng values from the root node of the neghbor P-Tree [12]. They compared the tme consumng of ther method wth nested loop (NL) as followng (see fg 2): Fgure 2 Comparson of Scalablty of NL, PODM, and PODMP [10] In fact, as concluson both the defntons of, and can only capture global outlers, because these defntons take a global vew of the data set. For a data set wth smple structure, for example, one that contans one or more clusters wth smlar densty, these two defntons work well. However, for many real world data sets whch have complex structure, the methods based on these two defntons mght not be able to fnd nterestng outlers. 4. Spatal Outler Detecton Spatal outlers are spatal objects whose non-spatal attrbute values are sgnfcantly dfferent from the value of ther neghborhoods. Spatal outler detecton methods n the lterature of spatal statstcs can be grouped nto two categores, graphcal approach and quanttatve tests. 5.1 Graphcal approach In graph based spatal outler detecton the man dea s based on graph connectvty [13]. For spatal outler detecton methods, the choce of statstcs s mportant and depends on what knd of data s consdered. The statstc that proposed s, where s attrbute functon, s the fxed set of neghbors of and s average attrbute value for neghbors of. In fact denotes the dfference of the 1 P-Tree-based outler detecton method usng prunng attrbute value of each node and the average of each neghbor. Detecton of outlers can be consders as /. and are the mean and standard devaton of all. The most costly part of the algorthm s to fnd neghbor nodes set. The I/O cost of fnd neghbor nodes set s determned by connectvty resdue rato (CRR),.e. how the nodes are grouped nto dsk pages. If the node and ts entre neghbor nodes can be resde n the same dsk page, there wll be no redundant I/O operaton requred. 5.2 Quanttatve tests Chang et.al (2003) proposed two teratve algorthms that detect outler by mult teratons and also employ a non-teratve algorthm whch uses medan as the neghborhood functon namely, teratve algorthm, teratve algorthm and medan algorthm respectvely. The frst and second algorthm compute the nearest neghbors set ( ) for each spatal pont and a neghborhood functon whch s the average attrbute values of of. Consder both algorthms, to detect the spatal outlers, the attrbute value of each pont (attrbute functon : ) wll be compared to those attrbute values of ts neghborhoods by a comparson functon. Then a pont x s an outler f s a maxmum value of the set,,, whch. It means that s an outler f compare to threshold wll be large enough. Once an outler s detected, some correctons are made mmedately, such as replacng the attrbute value of outlers by the average of ts neghbors to avod normal ponts labeled as outler canddates. In the thrd algorthm (medan), nstead of the average value, s the medan (n the ordered data set,, the medan s ) of the attrbute values n the data set :. All the three proposed algorthms wll detect true outlers more effcent than algorthem [15], Scatterplot [16] and Moran Scatterpolt algorthm [17]. The method that next ntroduced by Zhan et.al (2004), ntroduced a set of mult-attrbutve and mult-dmensonal spatal objects ( n a matrx ) each wth attrbutes correspondng n a twodmensonal matrx, could accurately detect spatal outlers after the attrbuted correlatons was calculated by, wth the attrbute functon :. Ths method also employ an attrbute mportant values set (0 9 for 1,2,, ) whch s the mportant degree of attrbutes related to dfferent attrbutes of objects n the data set,,,. Consder object, assumng the spatal objects n neghborhood of, In order to compute the dstrbuton value of neghborhoods connectng wth, an aggregate functon of attrbute correlatons s proposed. k ' Faggr ( s ) = R F ( s ) / k = 0 The estmaton of mult attrbutve set: V ( s ) = P ( F ' ( s ) F ' aggr ( s )) Accordng to the theory of mult dmensonal dstrbuton of random functon f and are the sample mean and varance of

the set, to detect the outlers concernng the set / whch s the standard value of each. So now we conclude that s extreme value n orgnal data set f s extreme n the standard data set, as before t should be compared to threshold. To gan better result n complexty of computaton of the last algorthm an auxlary secondary ndex (the dynamc ndex R-tree structure) on the top of the data fle s used to support the query operaton. The expermental test shows that the algorthm wll detects true outlers more effcent than and medan algorthm. Hung et al. (2005) ntroduced new densty based spatal outler detecton wth stochastcally searchng algorthm, named SODSS. Ths method reduced many neghborhood queres. It does not scan data base one by one to fnd the neghborhood of each spatal pont lke DBSCAN. In fact the algorthm dvdes data set nto three segments or labeled data, cluster set, canddate set and outler. Unlke the DBSCAN and GDBSCAN, once the algorthm has labeled the neghbors as a part of a cluster, t wll not examne each neghborhood for each of those neghbors. Neghborhood query could be computed n log usng data structure. Wth the new approach the complexty of computaton decreases from to log, whch s related to the threshold or maxmum numbers of neghbors and t s much smaller than. 5. Dscusson and Concluson The earlest methods that need the users to Have knowledge about the dstrbuton of data sets [4]. All the earlest method (dstance or densty based) wll result poorly as the dmenson ncreases. To have better result researchers such as [19]. The other factor s the tme complexty of exstng algorthms consderaton. Some algorthms such as nested loop (NL) [6] wll scan the data set at least twce, whch s very expensve for large data sets that the result needed mmedately. Some method presented to have a better performance n large data sets [10] [11]. The tme complexty of the known algorthms s as followng (see Table 1). Table 1 the tme complexty of exstng algorthms Algorthm Nested-loop [6] Tree Indexed Complexty log Cell Based [19] lnear n, exponental n (dmenson ) PODMP [11], where s much small than log 6. Acknowledgments I wsh to thank my supervsor Dr Mohd Noor Md Sap and revewers for ther nsghtful comments. Ths work was supported by Mnstry of scence, Technology and Innovaton grant vote 79224. 7. References [1] Hawkns (1980). Identfcaton of outlers. Chapman and Hall, London. 1980. [2] Grubbs, F. E. (1969). Procedures for detectng outlyng observatons,technometrcs,11, 1 21. [3] Aggarwal, C. C. & Yu, P. S. (2001). Outler Detecton for Hgh Dmensonal Data. Proceedngs of the ACM SIGMOD Conference 2001. [4] Knorr, E. M. & Ng, R. T. (1998). Algorthms for Mnng Dstance-Based Outlers n Large Datasets. Proceedngs of the VLDB Conference, 392 403, New York, USA. [5] Ramaswamy, S., Rastog, R. & Shm, K. (2000). Effcent Algorthms for Mnng Outlers from Large Data Sets. Proceedngs of the ACM SIGMOD Conference on Management of Data, Dallas, TX, 427 438. [6] Han and M. Kamber, Data Mnng: Concepts and Technques. The Morgan Kaufmann Seres n Data Management Systems, Jm Gray, Seres Edtor Morgan Kaufmann Publshers, 550 pages, August 2000. [7] Hautamak, Ismo Karkkanen and Pas Frant (2004). Outler Detecton Usng k-nearest Neghbor Graph. Proceedngs of the 17th Internatonal Conference on Pattern Recognton (ICPR 04). [8] Breung, M. M., Kregel, H.-P., Ng, R. T., and Sander, J., Lof: Identfyng densty-based local outlers, Proceedngs of the 2000 ACM SIGMOD Internatonal Conference on Management Data, Dallas, Texas, USA, ACM, 2000, pp. 93 104. [9] Brto, E. L. Chavez, A. J. Quroz, and J. E. Yukch. Connectvty of the mutual -nearest-neghbor graph n clusterng and outler detecton. Statstcs & Probablty Letters, 35(1):33 42, August 1997. [10] Bay, S. D. and Schwabacher, M., Mnng dstance-based outlers n near lnear tme wth randomzaton and a smple prunng rule, Proceedngs of Nnth ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, Washngton, D.C. USA, 2003, pp. 29 38. [11] Ren, Imad Rahal, Wllam Perrzo (2004). A Vertcal Dstance-based Outler Detecton Method wth Local Prunng., 2004, Washngton, DC, USA. Copyrght 2004 ACM, CIKM 04 November 8-13. [12] Dng, M. Khan, A. Roy, and W. Perrzo. The P-tree algebra. Proceedngs of the ACM SAC, Symposum on Appled Computng, 2002. [13] Shekhar, Ch.T Lu, and P.Zhang. (2002). Detectng Graphbased Spatal Outlers. Intellgent Data Analyss: An Internatonal Journal, 6(5):451 468. [14] Chang-Lu, D.Cheng, and Y.Kou. (2003), Algorthms for Spatal Outler Detecton. Proceedngs of the Thrd IEEE Internatonal Conference on Data Mnng (ICDM 03) pp. 597 600. [15] Shekhar, C.-T. Lu, and P. Zhang. Detectng Graph-Based Spatal Outler: Algorthms and Applcatons (A Summary of Results). In Proc. of the Seventh ACM-SIGKDD Int l

Conference on Knowledge Dscovery and Data Mnng, Aug 2001. [16] A. Luc. Exploratory Spatal Data Analyss and Geographc Informaton Systems. In M. Panho, edtor, New Tools for Spatal Analyss, pages 45 54, 1994. [17] A. Luc. Local Indcators of Spatal Assocaton: LISA. Geographcal Analyss, 27(2):93 115, 1995. [18] Huang, X.Qn, C.Chen, and Q.Wang.(2005), Densty Based Spatal Outler Detectng. Sprnger-Verlag Berln Hedelberg, ICCS 2005, LNCS 3514, pp. 979 986. [19] Aggarwal, C. C. and Yu, P. S., An effectve and effcent algorthm for hgh-dmensonal outler detecton. VLDB J., Vol. 14, No. 2, 2005, pp. 211 2