A Comparative Study for Outlier Detection Techniques in Data Mining

A Comparatve Study for Outler Detecton Technques n Data Mnng Zurana Abu Bakar, Rosmayat Mohemad, Akbar Ahmad Department of Computer Scence Faculty of Scence and Technology Unversty College of Scence and Technology 230 Kuala Terengganu, Malaysa {zurana, rosmayat}@kustem.edu.my, Mustafa Mat Ders, Member IEEE Faculty of Informaton Technology and Multmeda College Unversty Technology Tun Hussen Onn 86400 Part Raja, Batu Pahat Johor, Malaysa mmustafa@kuttho.edu.my Abstract Exstng studes n data mnng mostly focus on fndng patterns n large datasets and further usng t for organzatonal decson makng. However, fndng such exceptons and outlers has not yet receved as much attenton n the data mnng feld as some other topcs have, such as assocaton rules, classfcaton and clusterng. Thus, ths paper descrbes the performance of control chart, lnear regresson, and Manhattan dstance technques for outler detecton n data mnng. Expermental studes show that outler detecton technque usng control chart s better than the technque modeled from lnear regresson because the number of outler data detected by control chart s smaller than lnear regresson. Further, expermental studes shows that Manhattan dstance technque outperformed compared wth the other technques when the threshold values ncreased. Keywords data mnng, clusterng, outler I. INTRODUCTION Data mnng s a process of extractng vald, prevously unknown, and ultmately comprehensble nformaton from large datasets and usng t for organzatonal decson makng [1]. However, there a lot of problems exst n mnng data n large datasets such as data redundancy, the value of attrbutes s not specfc, data s not complete and outler [2]. An outler s defned as data pont whch s very dfferent from the rest of the data based on some measure. Such a pont often contans useful nformaton on abnormal behavor of the system descrbed by data [3]. On the other hand, many data mnng algorthms n the lterature fnd outlers as a sdeproduct of clusterng algorthms. From the vewpont of a clusterng algorthm, outlers are objects not located n clusters of dataset, usually called nose [2]. Outler detecton problem s one of the very nterestng problems arsng recently n the data mnng research. Recently, a few studes have been conducted on outler detecton for large datasets [3]. Many data mnng algorthms try to mnmze the nfluence of outlers or elmnate them all together. However, ths could result n the loss of mportant hdden nformaton snce one person s nose could be another person s sgnal [4]. In other words, the outlers themselves may be of partcular nterest, such as n the case of fraud detecton, where outlers may ndcate fraudulent actvty [5]. Outler detecton or outler mnng s the process of dentfyng outlers n a set of data. The outler detecton technque fnds applcatons n credt card fraud, network robustness analyss, network ntruson detecton, fnancal applcatons and marketng [3]. Thus, outler detecton and analyss s an nterestng and mportant data mnng task. Ths paper dscussed about control chart, lnear regresson and Manhattan dstance technques for outler data detecton from data mnng perspectve. The man nherent dea s to compare those technques to determne whch technque s better based on the number of outler data detected and threshold values. There are many types of data n outler detecton analyss such as bnary varables, nomnal and ordnal. However, n ths outler detecton analyss, only numercal data wll be consdered. The rest of ths paper s organzed as follows. Secton 2 dscuss related work on outler data detecton technques. The framework and formulas (equatons) for control chart, lnear regresson, and Manhattan dstance technques are presented n Secton 3 and extensve performance evaluaton s reported n secton 4. Secton 5 concludes wth a summary of those outler data detecton technques. II. RELATED WORK Recently, a few studes have been conducted on outler data detecton for large datasets. Dstrbuton based methods was prevously conducted by the statstcs communty. In these technques, the data ponts are modeled usng a stochastc dstrbuton, and ponts are determned to be outlers dependng upon ther relatonshp wth ths model. However, wth ncreasng dmensonalty, t becomes ncreasngly dffcult and naccurate to estmate the multdmensonal dstrbutons of the data ponts [6]. Dstance based method was orgnally proposed by Knorr and Ang [4]. Further, Ramaswamy et al. [6], had extended dstance-based outler detecton algorthm: the top n ponts wth the maxmum Dk are consdered outlers, where Dk(p) denotes the dstance of the k-th nearest neghbor of p. They used a 1-4244-0023-6/06/$20.00 2006 IEEE CIS 2006

cluster algorthm to partton a dataset nto several groups. Prunng and batch processng on these groups could mprove effcency for outler detecton [7]. On the other hand, Devaton-based outler detecton does not use statstcal tests or dstance-based measures to dentfy exceptonal objects. Instead, t dentfes outlers by examnng the man charaterstcs of objects n a group. Objects that devate from ths descrpton are consdered outlers. Hence, n ths approach the term devatons s typcally used to refer to outlers [5]. Whlst, densty based was proposed by Breung et al. [2]. It reles on the local outler factor (LOF) of each pont, whch depends on the local densty of ts neghborhood. Clusterngbased outler detecton technques regarded small clusters as outlers [8] or dentfed outlers by removng clusters from the orgnal dataset [1]. Meanwhle Dangtong Yu et al. [1] proposed a new method whch apply sgnal-processng technques to solve mportant problems n data mnng. They ntroduced a novel devaton (or outler) detecton approach, termed FndOut, based on wavelet transform. The man dea n FndOut s to remove the clusters from the orgnal data and then dentfy the outlers. Although prevous research showed that such technques may not be effectve because of the nature of the clusterng, FndOut can successfully dentfy outlers from large datasets. Expermental results showed that the proposed approach s effcent and effectve on very large datasets [1]. In addton, Aggarwal and Yu [3] ntroduced a new technque for outler detecton whch s especally suted to very hgh dmensonal data sets. The method works by fndng lower dmensonal projectons whch are locally sparse, and cannot be dscovered easly by brute force technques because of the number of combnatons of possbltes. Ths technque for outler detecton has advantages over smple dstance based outlers whch cannot overcome the effects of the dmensonalty curse. They llustrated how to mplement the technque effectvely for hgh dmensonal applcatons by usng an evolutonary search technque. Ths mplementaton works almost as well as a brute-force mplementaton over the search space n terms of fndng projectons wth very negatve sparsty coeffcents, but at a much lower cost. The technques dscussed n ths paper extend the applcablty of outler detecton technques to hgh dmensonal problems; such cases are most valuable from the perspectve of data mnng applcatons [3]. Whle as Wllams et. al [9], proposed replcator neural networks (RNNs)for outler detecton. They compared RNN for outler detecton wth three other methods usng both publcly avalable statstcal datasets (generally small) and data mnng datasets (generally much larger and generally real data). The RNN method performed satsfactorly for both small and large datasets. It was of nterest that t performed well on the small datasets snce neural network methods often have dffculty wth such smaller datasets. Its performance appears to degrade wth datasets contanng radal outlers and so t s not recommended for ths type of dataset. RNN performed the best overall on the KDD ntruson dataset [9] Thus, from the several studes dscussed above, we found that research n outler detecton can lead to the dscovery of truly unexpected knowledge n areas such as electronc commerce exceptons, bankruptcy and credt card fraud. Such knowledge can lead to new drectons for future nvestment, marketng, and other purposes. III. RESEARCH METHODOLOGY Outler detecton approach can be categorzed nto three approaches whch there are the statstcal approach, the dstance-based approach and the devaton-based approach. In ths outler analyss, we examne statstcal approach because ths approach s approprate for one-dmensonal samples. Therefore, ths approach s applcable snce ths analyss s based on one-dmensonal data. Ths analyss appled control chart and lnear regresson technques for statstcal approach. Besdes that, we also examne dstance-based approach n order to counter the man lmtatons mposed by statstcal approach [9]. The Manhattan dstance technque was appled for dstance-based approach. A. Statstcal Approach The statstcal approach to outler detecton assumes a dstrbuton or probablty model for the gven data set and then dentfes outlers wth respect to the model usng a dscordancy test [5]. In partcular, an analyss for statstcal approach s based on the fve phases: 1) Data collecton: Ths analyss s based on our observaton of the ar polluton data taken n Kuala Lumpur on the August 2002. A set of ar polluton data tems conssts of fve major aspects that can cause the ar polluton,.e. {Carbon Monoxde (CO), Ozone (O 3 ), Partculate Matter (PM ), Ntrogen Doxde (NO 2 )and Sulfur Doxde (SO 2) }. The value of each tem s wth the unt of part per mllon (ppm) except PM s wth the unt of mcro-grams (µgm). The data were taken for every one-hour every day. We present the actual data as the average amount of each data tem per day. 2) Compute average value/compute Lnear Regresson equaton: At ths phase, average value was computed n order to gan the centre lne for the control chart technque. Otherwse, lnear regresson equaton also calculated to determne lnear regresson lne. 3) Compute upper and lower control lmts/compute upper and lower bound value: Upper control lmt (UCL) and lower control lmt (LCL) for control graph technque are based on the partcular formula (refer equaton (2) to (5) at B secton). Whle as, upper and lower bound for lnear regresson technque s based on 95 percent from lnear regresson equaton (lne). 4) Data Testng: At ths phase, actual data, centre lne, UCL and LCL are plotted on the control graph whle as actual data, lnear regresson lne, upper and lower bound are plotted lnear regresson graph. Outler data could be dentfed from those graphs. Data that are plotted out from upper and lower control lmts/bound are detected as outler data. 5) Analyss and comparson the output: The output from data testng wll be used n order to compare and analyss those technques. The purpose of these actvtes s to get the

best technque n detectng outler data based on statstcal approach. B. Control Chart Technque (CCT) In ths secton, we study control chart technque for outler data detecton. Usually, CCT s used to determne whether your process s operatng n statstcal control. The purpose of a control chart s to detect any unwanted changes n the process. These changes wll be sgnaled by abnormal (outler) ponts on the graph []. Bascally, control chart conssts of three basc components: 1) a centre lne, usually the mathematcal average of all the samples plotted. 2) upper and lower control lmts that defne the constrants of common cause varatons. 3) performance data plotted over tme. Frstly, calculate the average for data ponts to get a centerlne of a control chart. The formula s, where, X = mean/average value X = every data value (X X n) n = total number of data Secondly, calculate the upper control (UCL) and lower control lmt (LCL) by usng formula below, In a 3-sgma system, Z s equal to 3. The reason that 3-sgma control lmts balance the rsk of error s that, for normally dstrbuted data, data ponts wll fall nsde 3-sgma lmts 99.7% of the tme when a process s n control. Ths makes the wtch hunts nfrequent but stll makes t lkely that unusual causes of varaton wll be detected []. Fnally, data are plotted on the chart and data that are out from UCL and LCL and are detected as outler data. Fgure 1 shows an example of control chart that has one data outsde UCL. Ths data s known as outler data. (2) (3) (4) (5) (1) Fgure 1. An example of control chart C. Lnear Regresson Technque (LRT) There have been many statstcal concepts that are bass for data mnng technques such as pont estmaton, Bayes theorem and regresson. Nevertheless, for ths outler detecton analyss, LRT s beng used because t s approprate to evaluate the strength of a relatonshp between two varables. In general, regresson s the problem of estmatng a condtonal expected value. Whle as lnear refers to the assumpton of a lnear relatonshp between y (response varable) and x (predctor varable). Thus, n statstcs, lnear regresson s a method of estmatng that lnear relatonshp between the nput data and the output data [11]. The common formula for a lnear relatonshp used n ths model s [5], Y = α + βx where, the varance of Y s assumed to be constant, and α and β are regresson coeffcents specfyng the Y-ntercept and slope of the lne, respectvely. Gven s samples or data ponts of the form (x 1, y 1 ), (x 2, y 2 ), (x s, y s ), then α and β can be estmated usng ths method wth the followng equatons, β s ( = 1 = s x x)( y = 1 ( x x) α = y βx y) where, x s the average of x 1, x 2,, x s, and y s the average of y 1, y 2,, y s. The coeffcents α and β often provde good approxmatons to otherwse complcated regresson equatons. D. Dstance-based Approach One of the statstcal approach drawbacks s t requres knowledge about parameters of the data set, such as the data dstrbuton. However, n many cases, the data dstrbuton may not be known [5]. Therefore, a dstance-based approach was ntroduced to overcome the problem arse from statstcal approach. The crteron for outler detecton usng ths approach s based on two parameters, parameter (p) and dstance (d), whch may be gven n advance usng knowledge about the data, or whch may be changed durng the teratons 2 (6) (7) (8)

to select the most representatve outlers. In partcular, an analyss for dstance-based approach s based on nne phases below. 1) Data collecton: As dscussed n secton A. 2) Compute the dstances of each data (d 1 ): The dstance between data was computed to yeld dstances of each data. 3) Identfy maxmum dstance value of data (d 2 ): The maxmum dstance value was dentfed to determne a range for threshold dstance value (d 3 ). 4) Determne threshold dstance value (d 3 ): Ths value was determned based on maxmum dstance value (d 2 ). Threshold dstance value (d 3 ) should be smaller than maxmum dstance value (d 2 ). Otherwse, comparson process could not be done. 5) Compare between d 3 and d 1 (p): At ths phase, parameter value (p) could be determne by comparng between d 3 and d 1 where p equal to d 1 >= d 3. 6) Determne threshold value (t): Threshold value (t) has to be assgned to ndcates the research space. 7) Compare between t and p: At ths phase, threshold value wll compare wth the result at phase fve. 8) Data testng: At ths phase outler data could be dentfed. 9) Analyss and comparson the output: The output from data testng wll be used n order to compare and analyss ths technques. E. Manhattan Dstance Technque (MDT) Commonly, the dstances can be based on a sngle dmenson or multple dmensons. It s up to the researcher to select the rght method for hs/her specfc applcaton. For ths outler detecton analyss MDT s used because the data are sngle dmenson. The general formula for MDT s, j k h 1 d( t, t ) = = ( t t h jh ) (9) where: t = <t 1,,t k > and t j = <t j1,,t jk > are tuples n a database. IV. PERFORMANCE EVALUATION In ths secton, frstly, we compared the effcency of the lnear regresson and control chart technques (statstcal approach). The mplementaton of both algorthms s usng Matlab 6.5 and Mcrosoft Access as ts database. Through the performance evaluaton, we are gong to show that the control chart technque s better than lnear regresson due to the number of outler data detecton s smaller than lnear regresson technque. As menton n 3.3, ths outler analyss s based on ar polluton data. The example of ar polluton data s shows n Table I: TABLE I. Date CO O 3 AIR POLLUTION DATA PM NO 2 SO 2 1/8/02 2.26 0.0 74 0.005 0.041 2/8/02 2.46 0.120 68 0.004 0.037............ 30/8/02 2.05 0.012 60 0.006 0.029 Based on both technques, outler data was determned f the data was out of the control lmts or boundares. In control chart technque, UCL and LCL were determned based on the formulas (equatons) dscussed n secton A. Whle as, upper and lower boundares n lnear regresson technques are based on 95 percent computaton from lner regresson equaton that has been dentfed. Number of Outler Data TABLE II. THE RESULT FOR CCT AND LRT Data Outler data for CCT Outler data for LRT CO 16 25 O 3 18 30 PM 20 25 SO 2 21 29 NO 2 16 21 Result Testng: Comparson between Two Outler DetectonTecnques 40 30 20 0 CO PM NO2 Ar Polluton Component Control Chart Technque Lnear Regresson Technque Fgure 2. Graph for outler data detecton usng CCT and LRT As llustrated n Table II and Fgure 2, outler data that have been detected by control chart were lower than lnear regresson technque. Ths mples that, the lower the number of outler data detected, the better the technque s. Ths s due to data plotted on control chart technque are more converged on the data average lne. Thus, there are more useful data that could be used for analyss and further could acqure an accurate result. Secondly, we analyss the MDT (dstance-based approach). The mplementaton of ths algorthm also usng Matlab 6.5 and Mcrosoft Access as ts database. In Manhattan dstance technque, the threshold values (tv) have to be assgned. Besdes that, outler data also depends on the threshold dstance values (d 3 ). The d 3 have to be smaller than maxmum dstance values (d 2 ) that exst between each of the data. Ths s to ensure that d 3 dd not out of range and the comparson process could be done. We can get the parameter value (p) by comparng d 3 and the dstances of each data (d 1 ). Further, we compare t wth

p to gan outler data. From equaton (9), we obtaned d 2, d 3, tv and the number of outler as n Table 3. Data TABLE III. Max. dstance value (d 2) THE RESULT FOR MDT Threshold dstance value (d 3) CO 1.82 1.0 O 3 0.08 0.01 PM 81 50 SO 2 0.07 0.003 NO 2 0.028 0.0 Number of Outler Threshold value (tv) Number of Outler 2 15 4 13 6 9 2 27 5 17 7 11 2 7 3 5 4 2 4 21 5 12 6 11 6 12 7 7 8 5 Result Testng: Manhattan Dsance Technque 30 25 20 15 5 0 Threshold CO O3 PM SO2 NO2 Fgure 3. Graph for outler data detecton usng MDT Table III and Fgure 3 show that when the threshold values ncreases, the number of outler data detected decreased. Ths mples that, numbers of outlers are nversed wth threshold value. Ths s due to the space of the useful data n the cluster becomes bgger. TABLE IV. Data THE COMPARISON RESULT FOR THREE TECHNIQUES Outler data for MDT Outler data for CCT Outler data for LRT CO 9 16 25 O3 11 18 30 PM 2 20 25 SO2 11 21 29 NO2 5 16 21 Number of Outler Data 35 30 25 20 15 5 0 Result Testng: Comparson between Three Technques CO O3 PM SO2 NO2 Ar polluton Component Manhattan Dstance Control Chart Lnear Regresson Fgure 4. Graph for outler data detecton usng three technques As llustrated n Table IV and Fgure 4, outler data that have been detected by Manhattan dstance were lower than control chart and lnear regresson technques. Ths s due the lower number of outler data detected, the better the technque. Ths mples that, dstance-based approach s more practcal and relable than statstcal approach n outler data detecton. V. CONCLUSON Ths paper presented the result of an expermental study of some common outler detecton technques. Frstly, we compare the two outler detecton technques n statstcal approach, lnear regresson and control chart technques. The expermental results ndcate that the control chart technque s better than that lner regresson technque for outler data detecton. Next, we analyze Manhattan dstance technque based on dstance-based approach. The expermental studes shows that Manhattan dstance technque outperformed the other technques (dstance-based and statstcal-based approaches) when the threshold values ncreased. REFERENCES [1] Yu, D., Shekholeslam, G. and Zang, A fnd out: fndng outlers n very large datasets, In Knowledge and Informaton Systems, 2002, pp. 387-412. [2] Breung, M.M., Kregel, H.P., and Ng, R.T., LOF: Identfyng denstybased local outlers., ACM Conference Proceedngs, 2000, pp. 93-4. [3] Aggarwal, C. C., Yu, S. P., An effectve and effcent algorthm for hgh-dmensonal outler detecton, The VLDB Journal, 2005, vol. 14, pp. 211 221. [4] Knorr, E.M., Ng, R. T., Tucakov, V., Dstance-based outlers: algorthms and applcatons, The VLDB Journal, 2000, vol. 8, pp. 237 253. [5] Han, J. and Kamber, M., Data Mnng Concepts and Technques, USA: Morgan Kaufmann, 2001. [6] S. Ramaswamy, R. Rastog, and S. Kyuseok, Effcent algorthms for mnng outlers from large data sets. In Proc. of the ACM SIGMOD Internatonal Conference on Management of Data, 2000, pp. 93-4. [7] Aggarwal, C. C., Yu, S. P., Outler detecton for hgh dmensonal data, SIGMOD 01, 2001, pp. 37-46. [8] M.F. Jang, S.s. Tseng, C. M. Su., Two-phase clusterng process for outler detecton. pattern recognton letters, 2001, vol. 22(6-7), pp. 691 700.

[9] G.Wllams, R. Baxter, H. He, S. Hawkns, L. Gu, A comparatve study of RNN for outler detecton n data mnng. Proceedngs of the 2nd IEEE Internatonal Conference on Data Mnng (ICDM02) Maebash Cty, Japan, 2002, pp. 709-712. [] SkyMark: Control Chart, at http://www.skymark.com/resources/tools/control_charts.asp (accessed: 13 December 2005) [11] Wkpeda: Lnear Regresson, at http://en.wkpeda.org/wk/lnear_regresson (accessed: 13 December 2005)