Incremental Local Outlier Detection for Data Streams

Size: px

Start display at page:

Download "Incremental Local Outlier Detection for Data Streams"

Owen Maurice Sherman
6 years ago
Views:

1 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 Incremental Local Outlier Detection or Data Streams Dragoljub Pokrajac CIS Dept. and AMRC Delaware State University Dover DE 99 Aleksandar Lazarevic United Tech. Research Center Silver Lane, MS 9- East Hartord, CT, USA Longin Jan Latecki CIS Department. Temple University Philadelphia, PA 9 Abstract. Outlier detection has recently become an important problem in many industrial and inancial applications. This problem is urther complicated by the act that in many cases, outliers have to be detected rom data streams that arrive at an enormous pace. In this paper, an incremental LOF (Local Outlier Factor algorithm, appropriate or detecting outliers in data streams, is proposed. The proposed incremental LOF algorithm provides equivalent detection perormance as the iterated static LOF algorithm (applied ater insertion o each data record, while requiring signiicantly less computational time. In addition, the incremental LOF algorithm also dynamically updates the proiles o data points. This is a very important property, since data proiles may change over time. The paper provides theoretical evidence that insertion o a new data point as well as deletion o an old data point inluence only limited number o their closest neighbors and thus the number o updates per such insertion/deletion does not depend on the total number o points N in the data set. Our experiments perormed on several simulated and real lie data sets have demonstrated that the proposed incremental LOF algorithm is computationally eicient, while at the same time very successul in detecting outliers and changes o distributional behavior in various data stream applications. I. INTRODUCTION Despite the enormous amount o data being collected in many scientiic and commercial applications, particular events o interests are still quite rare. These rare events, very oten called outliers or anomalies, are deined as events that occur very inrequently (their requency ranges rom % to less than.% depending on the application. Detection o outliers (rare events has recently gained a lot o attention in many domains, ranging rom video surveillance and intrusion detection to raudulent transactions and direct marketing. For example, in video surveillance applications, video trajectories that represent suspicious and/or unlawul activities (e.g. identiication o traic violators on the road, detection o suspicious activities in the vicinity o objects represent only a small portion o all video trajectories. Similarly, in the network intrusion detection domain, the number o cyber attacks on the network is typically a very small raction o the total network traic. Although outliers (rare events are by deinition inrequent, in each o these examples, their importance is quite high compared to other events, making their detection extremely important. Data mining techniques developed or this problem are based on both supervised and unsupervised learning. Supervised learning methods typically build a prediction model or rare events based on labeled data (the training set, and use it to classiy each event [, ]. The major drawbacks o supervised data mining techniques include: ( necessity to have labeled data, which can be extremely time consuming or real lie applications, and ( inability to detect new types o rare events. In contrast, unsupervised learning methods typically do not require labeled data and detect outliers as data points that are very dierent rom the normal (majority data based on some measure []. These methods are typically called outlier/anomaly detection techniques, and their success depends on the choice o similarity measures, eature selection and weighting, etc. They have the advantage o detecting new types o rare events as deviations rom normal behavior, but on the other hand they suer rom a possible high rate o alse positives, primarily since previously unseen (yet normal data can be also recognized as outliers/anomalies. Very oten, data in many rare events applications (e.g. network traic monitoring, video surveillance, web usage logs arrives continuously at an enormous pace thus posing a signiicant challenge to analyze it []. In such cases, it is important to make decisions quickly and accurately. I there is a sudden or unexpected change in the existing behavior, it is essential to detect this change as soon as possible. Assume, or example, there is a computer in the local area network that uses only limited number o services (e.g., Web traic, telnet, tp through corresponding ports. All these services correspond to certain types o behavior in network traic data. I the computer suddenly starts to utilize a new service (e.g., ssh, this will certainly look like a new type o behavior in network traic data. Hence, it will be desirable to detect such behavior as soon as it appears especially since it may very oten correspond to illegal or intrusive events. Even in the case when this speciic change in behavior is not necessary intrusive or suspicious, it is very important or a security analyst to understand the network traic and to update the notion o the normal behavior. Further, on-line detection o unusual behavior and events also plays a signiicant role in video and image analysis [-]. Automated identiication o suspicious behavior and objects (e.g., people crossing the perimeter around protected areas, leaving unattended luggage at the airport installations, cars driving unusually slow or unusually ast or with unusual trajectories based on inormation extracted rom video streams is currently an active research area. Other potential applications include traic control and surveillance o commercial and residential buildings. These tasks are characterized by the need or realtime processing (such that any suspicious activity can be identiied prior to making harm to people, acilities and installations and by dynamic, non-stationary and oten noisy environment. Hence, there is necessity or incremental outlier detection that can adapt to novel behavior and provide timely identiication o unusual events. --7-/7/$. 7 IEEE

2 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 Recently, LOF (Local Outlier Factor algorithm [9] has been successully applied in many domains or outlier detection in a batch mode [, ]. In this paper, we propose a novel incremental LOF algorithm that is appropriate or detecting outliers in data streams. The proposed incremental LOF algorithm is the irst incremental outlier detection algorithm to the best o our knowledge. It provides equivalent detection perormance as the static LOF algorithm, and has O(NlogN time complexity, where N is the total number o data points. The paper shows that insertion o new data points as well as deletion o obsolete points inluence only limited number o their nearest neighbors and thus insertion/deletion time complexity per data point does not depend on the total number o points N. Our experiments perormed on several simulated and real lie data sets have demonstrated that the proposed incremental LOF algorithm can be very successul in detecting outliers in various data streaming applications. II. BACKGROUND Outlier detection techniques [] can be categorized into our groups: ( statistical approaches; ( distance based methods; ( proiling methods; and ( model-based approaches. In statistical techniques [,, 7], the data points are typically modeled using a stochastic distribution, and points are labeled as outliers depending on their relationship with this model. Distance based approaches [, 9, ] detect outliers by computing distances among points. Several recently proposed distance based outlier detection algorithms are based on ( computing the ull dimensional distances among points using all the available eatures [] or only eature projections []; and ( on computing the densities o local neighborhoods [9, ]. In addition, clustering-based techniques have also been used to detect outliers either as side products o the clustering algorithms (points that do not belong to clusters [] or as clusters that are signiicantly smaller than others []. In proiling methods, proiles o normal behavior are built using dierent data mining techniques or heuristic-based approaches, and deviations rom them are considered as outliers (e.g., network intrusions. Finally, model-based approaches usually irst characterize the normal behavior using some predictive models (e.g. replicator neural networks [] or unsupervised support vector machines [, ], and then detect outliers as deviations rom the learned model. Initially proposed outlier detection algorithms determine outliers once all the data records (samples are present in the dataset. We reer to these algorithms as static outlier detection algorithms. In contrast, incremental outlier detection techniques [, 9, ] identiy outliers as soon as new data record appears in the dataset. Incremental outlier detection was also used within more general ramework o activity monitoring []. In addition, Domingos and Hulten [9] proposed broad requirements that incremental algorithms need to meet, while Yamanishi and Takeuchi [] used on-line discounting distributional learning o Gaussian mixture model and scoring based on the estimated probability density unction. In this study, we use propose an incremental outlier detection algorithm based on computing the densities o local neighborhoods. In our previous work [], we have experimented with numerous outlier detection algorithms or network intrusion identiication, and we have concluded that the local density based outlier detection approach (e.g. LOF typically achieved the best prediction perormance. The main idea o the LOF algorithm [9] is to assign to each data record a degree o being outlier. This degree is called the local outlier actor (LOF o a data record. Data records (points with high LOF have local densities smaller than their neighborhood and typically represent stronger outliers, unlike data points belonging to uniorm clusters that usually tend to have lower LOF values. The algorithm or computing the LOFs or all data records has the ollowing steps:. For each data record q compute k-distance(q as distance to the k-th nearest neighbor o q (or deinitions, see Section III.. Compute reachability distance or each data record q with respect to data record p as ollows: reach-dist k (q,p= max(d(q,p, k-distance(p ( where d(q,p is Euclidean distance rom q to p.. Compute local reachability density (lrd o data record q as inverse o the average reachability distance based on the k nearest neighbors o the data record q (In original LOF publication [9], parameter k was named MinPts. lrd( q =. ( reach dist ( q, p / k p knn ( q k. Compute LOF o data record q as ratio o average local reachability density o q s k nearest neighbors and local reachability density o the data record q. lrd( p ( ( k p knn q LOF q =. ( lrd( q The main advantages o LOF approach over other outlier detection techniques include: - It detects outliers with respect to density o their neighboring data records; not to the global model. - It is able to detect outliers regardless the data distribution o normal behavior, since it does not make any assumptions about the distributions o data records. In order to ully justiy the need or incremental outlier detection techniques, it is important to understand that applying static LOF outlier detection algorithms to data streams would be extremely computationally ineicient and/or very oten may lead to incorrect prediction results. Namely, static LOF algorithm may be applied to data streams in three ways:. Periodic LOF. Apply LOF algorithm on the entire data set periodically (e.g., ater every data block o data records is inserted, similar to the strategy discussed in [9] or ater all the data records are inserted. The major problem o this approach is inability to detect outliers related to the beginning o new behavior that initially appear within the inserted block. Fig. illustrates this scenario. Assume that a new data point d n (red asterisk in Fig. a is added to the original data distribution (blue dots in Fig. a. Initially, point d n is an outlier since it is distinct rom all other data records. However, when additional data records (red asterisks in Fig. b start to group around the initial data record d n, these new points are no longer outliers since they

3 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 orm their own cluster. Ideally, at the time o its insertion, data record d n should be identiied as an outlier []. However, in this periodic scenario, the LOF algorithm is applied when all data records are added to the original data set and thus already ormed the new distribution. Hence, all the points rom the new cluster, including d n, will be identiied as normal behavior! On the other hand, i an incremental LOF algorithm is applied ater every new data instance is inserted into the data set, not only it is possible to detect points like d n as outliers but also to detect the moment in time when this change o behavior occurred. d n a b Fig.. Inability o static LOF algorithms to identiy change o behavior: a Point d n can be correctly identiied as outlier by supervised LOF algorithm but not by periodic LOF; b Points belonging to the new distribution are incorrectly identiied as outliers by supervised LOF.. Supervised LOF. Given a training data set D at time interval t, LOF algorithm is irst applied to compute the k- distances, lrd and LOF values or all data records rom the training data set D. For every time interval, t > t, when a new data record d n is inserted into the data set, k-distance, reachability distances and lrd values are computed or the new record d n. However, when computing LOF(d n using Eq. (, lrd(d n is used along with pre-computed lrd values or the original data set D. It is apparent that this approach will result in several problems: (i Estimated value LOF(d n will not be accurate, since it uses pre-computed k-distance, reach-dist and lrd values; (ii a new behavior (shown as red asterisks in Fig. b will always be detected as outlier since this approach does not update the normal behavior proile; (iii masquerading (attempt o hiding within existing distribution cannot be identiied, since all inserted data points will always be considered as normal as they belong to normal distribution (Fig.. Namely, assume that a new data point d n (red square in Fig. a is inserted within existing data distribution and all new data points start to group around the point d n (red squares in Fig. b but with much higher density than the original data distribution. Apparently, these newly added points will orm a cluster o very high density which is substantially dierent than the cluster o the original distribution. Supervised LOF approach considers these points to belong to the original data distribution, since it is not aware o new data points orming the dense cluster. On the other hand, incremental LOF algorithm, ater insertion o each new data point, would identiy this phenomenon, since it can take into account the newly added points, when updating lrd and LOF values o existing points (that are already in the database. d a b Fig.. Detecting masqueraders (hiding within existing distribution. Iterated LOF. Re-apply the static LOF algorithm every time a new data record d n is inserted into the data set. This static LOF algorithm does not suer rom aorementioned problems, but is extremely computationally expensive, since every time a new point is inserted, the algorithm recomputes LOF values or all the data points rom the data set. Knowing that time complexity o LOF algorithm is O(n log n [9], where n is the current number o data records in the data set, total time complexity or this iterated approach, ater insertion o N points, is : O N n log n = O( N log N, ( n= Our proposed incremental LOF algorithm is designed to provably provide the same prediction perormance (detection rate and alse alarm rate as the iterated LOF. It is achieved by consistently maintaining or all existing points in the database the same LOF values as the iterated LOF algorithm. Our proposed incremental LOF algorithm eiciently addresses the problems mentioned in Fig. and, but has time complexity O(N logn thus clearly outperorming the static iterated LOF approach. Ater all N data records are inserted into the data set, the inal result o the incremental LOF algorithm on N data points is independent o the order o insertion and equivalent to the periodic LOF executed ater all the data records are inserted. III. METHODOLOGY When designing incremental LOF algorithm, we have been motivated by two goals. First, the result o the incremental algorithm must be equivalent to the result o the static algorithm every time t a new point is inserted into a data set. Thus, there would not be a dierence between applying incremental LOF and the periodic static LOF when all data records up to time instant t are available. Second, asymptotic time complexity o incremental LOF algorithm has to be comparable to the static LOF algorithm. In order to have easible incremental algorithm, it is essential that, at any time moment t, insertion/deletion o the data record results in limited (preerably small number o updates o algorithm parameters. Speciically, the number o updates per each insertion/deletion must not be dependent on the current number o records in the dataset; otherwise, the perormance o the incremental LOF algorithm would be Ω(N where N is the inal size o the dataset. In this section, we demonstrate eicient insertion and deletion o records in the incremental LOF algorithm and provide its exact time complexity analysis.

4 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 A. Incremental LOF algorithm The proposed incremental LOF algorithm computes LOF value or each data record inserted into the data set and instantly determines whether inserted data record is outlier. In addition, LOF values or existing data records are updated i needed. i. Insertion. In the insertion part, the algorithm perorms two steps: a insertion o new record, when it computes reach-dist, lrd and LOF values o a new point; b maintenance, when it updates k-distances, reach-dist, lrd and LOF values or aected existing points. Let us irst illustrate these steps through the example o inserting a new data point n into a data set shown on Fig. a. I we assume k =, we irst need to compute reachability distances to two nearest neighbors o the new data point n (data points, in Fig. a, so that its lrd value can be computed. As it is shown urther in the paper (Theorem, insertion o the point n may decrease the k- distance o certain neighboring points, and it can happen only to those points that have the new point n in their k- neighborhood. Hence, we need to determine all such aected points, (points,,, have point n in their -neighborhood, see Fig. a. According to Eq. (, when k-distance(p changes or a point p, reach-dist(q,p will be aected only or points q that are in k-neighborhood o the point p. In our example, previous -neighbors o data point are the data points, and, so reach-dist(,, and reach-dist(, will be updated (Fig. b. According to Eq. (, lrd value o a point q is aected i: a the k-neighborhood o the point q changes or b reach-dist rom point q to one o its k-neighbors changes. The -neighborhood o a point will change only i the new point n becomes one o its -neighbors. Hence, we need to update lrd on all points to which the point n is now one o their - neighbors (points,,, in Fig. b and on all points q where reach-dist(q,p is updated and p is among -nearest neighbors o q (points,,7 in Fig. c. According to Eq. (, LOF values o an existing point q should be updated i lrd(q is updated (points,,,,,,7 in Fig. d or lrd(p o one o its -neighbors p changes (points,9, In Fig d. Note that LOF value o point is not updated since point (where lrd is updated is not among its nearest neighbors. The general ramework or the incremental LOF method is shown in Fig.. As in the static LOF algorithm [9], we deine k-th nearest neighbor o a record p as a record q rom the dataset S such that or at least k records o S \ {p} it holds that d(p,o d(p,q, and or at most k- records o S \ {p} holds that d(p,o < d(p,q, where d(p,q denotes Euclidean distance between data records p and q. We reer d(p,q as k- distance(p. k nearest neighbors (reerred to as knn(p include all points r S \ {p} such that d(p,r d(p,q. We also deine k reverse nearest neighbors o p ( reerred to as krnn (p as all points q or which p is among their k nearest neighbors. For a given data record p, knn(p and krnn(p can be respectively retrieved by executing nearest-neighbor and reverse (a.k.a. inverse nearest neighbor queries [7-,]on a dataset S. The correctness o the insertion algorithm is based on the ollowing Theorems -. Theorem. The insertion o point p c aects the k-distance at points p j that have point p c in their k-nearest neighborhood, i.e., where p j krnn(p c. New k-distances o the aected points p j are updated as ollows: ( d( p j, pc, pc is the k - th nearest neighbor o p new j k distance ( p j = ( ( old ( k distance ( p j, otherwise. Proo. (sketch. In insertion, k-distance o an existing point p j changes when a new point enters the k-th nearest neighborhood o p j, since in this case the k-neighborhood o p j changes. I a new point p c is the new k-th nearest neighbor o p j, its distance rom p j becomes the new k-distance(p j. Otherwise, old k-th neighbor o p j becomes the new k-th nearest neighbor o p j (see Fig.. Corollary. During insertion, k-distance cannot increase, ( i.e., distance ( distance ( ( new old k p k p. j j Theorem. Change o k-distance(p j may aect reach-dist k (p i,p j or points p i that are k-neighbors o p j. (see Fig b. Proo (sketch. Using (, p i d(p i, p j > k-distance (old (p j, reach-dist (old k (p i,p j = d(p i, p j. According to Corollary, k- distance(p j cannot increase, hence i d(p i, p j > k- distance (old (p j, reach-dist (new k (p i,p j = reach-dist (old k (p i,p j. Theorem. lrd value needs to be updated or every record (denoted with p m in Fig. or which its k-neighborhood changes or or which reachability distance to one o its knn changes. Hence, ater each update o reach-dist k (p i,p j we have to update lrd(p i i p j is among knn(p i. Also, lrd is updated or all points p j whose k-distance was updated. Proo (sketch. Change o k-neighborhood o p m aects the scope o the sum in Eq. ( computed or all k-neighbors o p m. Change o the reachability distance between p m and some o its k-nearest neighbors aects corresponding term in the denominator o Eq. (. Theorem. LOF value needs to be updated or all data records p m which lrd has been updated (since lrd(p m is a denominator in Eq. ( and or those records that have records p m in their knns. Hence, the set o data records where LOF needs to be updated (according to ( corresponds to union o records p m and their krnn. Proo (sketch. Similar to the proo o Theorem, using (. ii. Deletion. In data stream applications it is sometimes necessary to delete certain data records (e.g., due to their obsoleteness. Very oten, not only a single data record is deleted rom the data set, but the entire data block that may correspond to particular outdated behavior. Similarly like in an insertion, upon deleting the block o data records S delete, there is a need to update parameters o the incremental LOF algorithm. The general ramework or deleting the block o data records S delete rom the dataset S is given in Fig.. The deletion o each record p c S delete rom dataset S inluences the k-distances o its krnn. k-neighborhood increases or each data record p j that is in reverse k-nearest neighborhood o p c. For such records, k-distance(p j becomes equal to the distance rom p j to its new k-th nearest neighbor. The reachability distances rom p j s (k- nearest neighbors p i to p j need to be updated. Observe that the reachability distance rom the k-th neighbor o p j to record p j is already equal to their Euclidean distance 7

5 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM n n 7 7 (a (b 9 9 n n 7 7 (c (d Fig.. The illustration o the proposed incremental LOF algorithm. a Insertion o a new point n (red results in computation o the reachability distance to its two nearest neighbors, (cyan and to update o -distance to reverse nearest neighbors o n (,,,, yellow. b Reachability distance reach-dist(q,p is updated to all reverse k-neighbors o the point n rom their -neighbors (blue arrows rom q to p. c lrd is updated or all points where -distance is updated and or points which reachability distance to their -neighbor changes (green. d LOF is updated or points where lrd is updated and or points where lrd o one o their - neighbors is updated (pink. Incremental LOF_insertion(Dataset S Given: Set S {p,,p N } p i R D, where D corresponds to the dimensionality o data records. For each data point p c in data set S insert(p c Compute knn(p c ( p j knn(p c compute reach-dist k (p c,p j using Eq. (; //Update_neighbors o p c S update_k_distance =krnn(p c ; ( p j S update_k_distance update k-distance(p j using Eq.(; S update_lrd = S update_k_distance ; ( p j S update_k_distance, ( p i knn(p j \{p c } reach-dist k (p i,p j =k-distance(p j ; i p j knn(p i S update_lrd = S update_lrd {p i }; S update_lof = S update_lrd ; ( p m S update_lrd update lrd(p m using Eq. (; S update_lof = S update_lof krnn(p m ; ( p l S update_lof update LOF(p l using Eq.(; compute lrd(p c using Eq.(; compute LOF(p c using Eq.(; End //or Fig.. The general ramework or insertion o data record and computing its LOF value in incremental LOF algorithm. d(p i, p j and does not need to be updated (Fig. 7. Analog to insertion, lrd value needs to be updated or all points p j where k-distance is updated. In addition, lrd value needs to be updated or points p i such that p i is in knn o p j and p j is in knn o p i. Finally, LOF value is updated or all points p m on which lrd value is updated as well as on their krnn. The correctness o the deletion algorithm can be proven analog to the correctness o the insertion algorithm. p c d(p j, p c -distance (old ( p j -distance (old (p j (a p j -distance (new (p j -distance (old (p j -distance (new (p j p j p c d(p j, p c < -distance (old (p j Fig.. Update o k-nearest neighbor distance upon insertion o a new record (k=. a New record p c is not among -nearest neighbors o record p j - distance(p j does not change; b New record p c is among -nearest neighbors o p j -distance(p j decreases. Cyan dashed lines denote updates o reachability distances between point p j and two old points. B. Computational eiciency o the incremental LOF algorithm To determine time complexity o the proposed incremental LOF algorithm, it is essential to demonstrate that the number o aected data records (updates o k-distance, reachability distances, lrd and LOF values does not depend on the current number n o records in the dataset, as stated by Theorems -. Subsequently, Corollaries - provide asymptotic time complexity or the proposed algorithm. (b

6 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 Incremental LOF_deletion(Dataset S,S delete S update_k_distance = ; ( p c S delete S update_k_distance = S update_k_distance krnn(p c ; delete(p c ; //we can delete p c ater inding // all reverse neighbors S update_k_distance = S update_k_distance \S delete ; //points rom S delete may still be present when computing reverse k- nearest neighbors ( p j S update_k_distance update k-distance(p j ; S update_lrd = S update_k_distance ; ( p j S update_k_distance ( p i (k-nn(p j reach-dist k (p i,p j =k-distance(p j ; i p j knn(p i S update_lrd = S update_lrd {p i }; S update_lof = S update_lrd ; ( p m S update_lrd update lrd(p m using Eq. (; S update_lof = S update_lof krnn(p m ; ( p l S update_lof update LOF(p l using Eq.(; return Fig.. The ramework or deletion o data record in incremental LOF method. -distance old(p j (a p j p c (b Fig. 7. Update o k-nearest neighbor distance upon deletion o record p c (k=. a Prior to deletion, data record p c is among -nearest neighbors o record p j; b Ater deletion o p c, -distance(p j increases and reachability distances rom two nearest neighbors o p j (denoted by cyan dashed lines are updated to - distance(p j. Theorem. Maximal number F o k reverse nearest neighbors o a record p is proportional to k, exponentially proportional to the dimensionality D, and does not depend on n. To prove Theorem, we will irst establish a ew deinitions [7] and prove Lemmas, necessary to establish D- dimensional geometry o the space where the considered data records reside. Deinition. The cone C in D-dimensional space with vertex v and axis l is locus o points x such that the Euclidean distance o the point x to the vertex d(x,v is proportional to the distance d(x,v o point s projection x onto l to the vertex v. A line containing a point x on cone and the origin is called generatix. When the vertex is at the origin o the coordinate system (v=o we reer a cone as centered cone. p j -distance new(p j Fig.. Illustration o lemma in three-dimensional space Deinition. A hal-axis angle o a cone is an angle α/= xvx. It is angle between the axis and any generatrix. Lemma. Coordinates X, X,, X D o a point x on the centered cone C, where the axis l o the cone is parallel to the x axis o the coordinate system, satisy: X + X + X D =a X, a > ( where a is a pre-speciied parameter. Hal-axis angle o C satisies the ollowing condition: α X. (7 cos = = D X + a i= i B Proo. Follows directly rom deinitions and and the act that the length o x is X when l is parallel to x. Deinition. All points which coordinates satisy relation X + X + X D <a X are inside the cone C and comprise set inside(c. Deinition. [] The ball B(c,R in D-dimensional space is locus o points x such that the distance o the point x rom a prespeciied point c is smaller or equal R>. Lemma. Consider cone C with vertex p and hal-axis angle α/. Consider ball B(p, R, and volume V=B C. ( p C d(p,p=r V B (p,r (See Fig. or treedimensional illustration. Proo. (sketch. Without loss o generality, we may assume that the cone is centered, i.e., that p is at the origin. Let coordinates o point p be X, X,, X D. Consider a point p symmetric to p with respect to the x axis. The point p has coordinates X, -X,, -X D. It is easy to observe that p C and d(p,p =R. The distance between these two points is d ( p', p'' X +... X D = ax p B R α= =. Since points p, p are on the sphere with radius R, rom Eq. (7 we can obtain α d ( p', p'' = Rsin Rsin = R. Thereore, the ball B (p, R contains the point p antipodal to p. It can be shown that B also contains any other boundary point o V. Deinition. A rame o cones C i, i =,,h is deined as a set o h cones with the common vertex p and common angle X V x r p R p 9

Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 α. The rame o cones completely cover the D-dimensional Euclidean space E D i: U h D C i = E (see Fig. 9.

7 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 α. The rame o cones completely cover the D-dimensional Euclidean space E D i: U h D C i = E (see Fig. 9. i= Lemma. The lower bound or the number o α = cones sin D ϕ dϕ in a D-dimensional rame is D h = = Θ( D min π π / sin D ϕ dϕ Proo (sketch. The lower bound o the number o cones in a rame is equal to the ratio o the area o the hypersphere and the area o the hyperspherical cap (part o the hypersphere inside the cone with angle α. Details are presented in []. Note that h min depends only on geometry o D-dimensional space and is independent o the number or the placement o D- dimensional data records. The ollowing Deinition and Corollary link geometry o D-dimensional cones to proximity notion in D-dimensional datasets. Deinition. Let S be set o the D-dimensional data records inside a cone C with vertex p. k-nearest neighbor o point p in the cone C is a record q rom the dataset S such that or at least k records o S \{p} it holds that d(p,o d(p,q, and or at most k- records o S \{p} holds that d(p,o <d(p,q. We also deine k-distance C (p as a distance rom record p to its k- th nearest neighbor q in the cone. k-nearest neighborhood o p in cone C, (reerred to as knn c (p includes all records r S \ {p} such that d(p,r d(p,q. Corollary. Consider cone C centered at data point p with α. Let p be a data point in cone C. ( p' inside( C p' knnc ( p p knnc ( p' Proo. (sketch. Consider ball B(p, k-distance C (p. According to the Deinition, the volume V=B C contains exactly k points other than p. Consider data point p such that d(p,p > k-distance C (p. Consider now ball B (p, k- distance C (p. According to Lemma, this ball has as a subset the whole volume V that contains total o k + data points (including p. Hence, k-distance C (p < k-distance C (p. Proo o Theorem. Due to Corollary and Deinitions,, krnn U h i= ( p knn C ( p i, where C i, i=,,h is rame o cones with vertex p. Hence, due to Lemma and Deinition, D krnn p F = knn p hmin = Θ k D.Since neither ( ( ( C i h min nor k depend on n, the number o reverse nearest Suboptimal values or h can be obtained by techniques that construct spherical codes [], ollowed by covering veriication (to ensure that the space around p is completely covered, e.g., based on examination o convex hulls acets []. Analog to the problem o optimal spherical codes, the problem o inding the smallest possible h or arbitrary D is unresolved and is related (but not equivalent to sphere covering problem []. Using the aorementioned suboptimal construction, in [] the upper bound on h is shown or several dimensions: more precisely, h is demonstrated to have upper bound o (see Fig. b,, in R, R, R, correspondingly. neighbors does not depend on the current number o points in the dataset. D spherical cap D (a (b Fig. 9. (a Two -degree D cones (b Spherical caps o the rame consisting o cones that completely cover D space. The ollowing theorems provide the upper bound or the number o points where k-distance, lrd and LOF are updated. Theorem. The number o data records where k-distance needs to be updated is S update_k_distance F or insertion, and S update_k_distance F* S delete, or deletion block o size S delete. Proo (sketch. For insertion/deletion o one data record s, k- distance needs to be updated on all data records that have the inserted/deleted data record in its neighborhood, i.e., on all reverse nearest neighbors o s. Theorem bounds the number o reverse neighbors with F. Theorem 7. Number o data records where lrd is updated is S update_lrd k S update_k_distance. Proo (sketch. lrd values are updated on points rom S update_k_distance and may be updated on their k-nearest neighbors. Theorem. Number o data records where LOF is updated is S update_lof (+F S update_lrd. Proo (sketch. LOF value is updated on data records rom S update_lrd and their reverse nearest neighbors, thus giving the bound stated in the Theorem. The ollowing corollaries provide asymptotic time complexity or the proposed algorithm. Corollary. The asymptotic time complexity or insertion and deletion in incremental LOF is : T incrlof_ins = O(k F T knn + k F T krnn + F k + T insert, ( T incrlof_del =O( S delete (k F T knn + k F T krnn + F k + T delete. Here, T knn and T krnn are time complexities o knn and krnn algorithms respectively, while T insert and T delete correspond to time needed or the insertion and deletion o a data record into/rom the database (including index updating. Proo (sketch. Follows rom the algorithms or insertion and deletion in incremental LOF, given in Fig. and Fig. respectively, and Theorems -. By maintaining list o krnn(p or each record p, the time complexities can be urther reduced to: T incrlof_ins = O(k F T knn + T krnn + F k + T insert ; T incrlof_del = O( S delete (k F T knn + F k + T delete

8 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 Corollary. When eicient algorithms or knn [e.g. ], krnn [e.g., 7-,], as well as eicient indexing structures or inserting/deleting the records [, ] are used (where T knn = T krnn = T insert = T delete = O(log n, the time complexities o T incrlofins and T incrlofdel are logarithmic in the current size n o the database, e.g.,: T incrlofins = O(k F log n + F k. (9 Corollary. Time complexity o the incremental LOF algorithm ater all updates to the dataset o size N are applied is O(N logn. Proo. Directly ollows rom Corollary. Note that according to Theorem, the time complexity o the incremental LOF may exponentially increase with the dimensionality D. However, this is well-known problem o static LOF [9] as well as other density based algorithms and not a particular issue with incremental LOF. IV. EXPERIMENTAL RESULTS Our experiments were perormed on several synthetic data and real lie data sets. In all our experiments, we have assumed that we have inormation about the outliers in the data set, so we could evaluate the detection perormance. In the ollowing subsectios we evaluate time complexity (subsection A and outlier detection accuracy (subsections B, C with respect to ground truth outlier inormation. A. Time Complexity Analysis Our time complexity analysis was perormed on synthetic data sets, since we could better control the total number o data records N in the data set as well as the number o dimensions D. Reported experimental results provide evidence about (i relation between the number o updates or LOF values and the total number o data points N; (ii the dependence o the number o updates or LOF values on LOF parameter k; and (iii the dependence o the number o updates or LOF values on the dimension D. Our synthetic data sets had dierent number o data records (N {,,,}, as well as dierent number o dimensions, (D {,,,,}. For each pair (D, N, we have created data sets with N random records generated rom D-variate distribution. We experimented with both uniorm and standard (zero mean, unit covariance matrix Gaussian distribution. For each o data sets generated or the pair (D, N, we varied the values o the parameter k (,,, o the algorithm and then measured the number o updates or k-distance, reach-dist, lrd and LOF values in the incremental LOF algorithm or insertion o a new data record into the dataset. Here, or each pair (D, N we report average number o LOF updates or all data sets generated using the standard Gaussian distributions. Results obtained or the data sets generated using uniorm distribution are analog and not reported here due to lack o space. Fig. shows how the number o updates o LOF values depends on the total number o data records N (x-axis in Fig. or dierent number o dimensions D (dierent lines in graphs in Fig., where each graph corresponds to distinct value o parameter k. # LOF updates # LOF updates 7 k= Dimension Number o points Dimension k= Number o points # LOF updates # LOF updates 9 7 Dimension k= Dimension Number o points k= Number o points Fig.. The dependence o number o LOF updates on the database size N or dierent number o dimensions D and dierent values o parameter k. The results are averaged over data sets generated rom standardized Gaussian distribution. Analyzing Fig., it can be observed that the number o updates o LOF value stabilizes or suiciently large N, which is in accordance with our theoretical analysis rom section III.B. showing O( updates with respect to N. It is interesting to note that or larger k, the number o data records, necessary to show stabilization o number o LOF updates, is generally larger. However, or typically used values o k (- [,] the number o LOF updates becomes constant or N>. Fig. shows the average number o LOF updates vs. parameter k (each curve corresponds to a particular value o D on database o N = points. The let graph contains abscise in linear scale, while the abscise in the right graph is quadratic (proportional to k. Fig. shows that the actual number o updates seems to change not aster than k. Thereore, the worst-case upper bound O(F k =O(k obtained in Section III.B. seems to be rather pessimistic. This phenomenon is due to the act that in reality, not all updates o reach-dist values result in update o lrd value. Also, a data record may belong to reverse nearest neighbors o multiple records on which lrd has been updated. Hence, such data record, although output rom several krnn queries, will result only in one update o LOF value. Fig. also provides an insight on the dependence o the number o LOF updates on the number o dimensions D. While undoubtedly the number o LOF updates increases with D, it was diicult to conirm (or reject theoretical upper boundary o the exponential dependence (see Section III.B. However, it is evident that the growth o LOF updates with respect to dimensionality D is not explosive (the average number o updates stay bellow even or D=, k=. One o the reasons is that considered upper bound or the number o reverse neighbors is the worst case and is reached rather inrequently. Hence, we anticipate that the dimensionality o the data will not become the bottleneck o the incremental LOF algorithm due to number o LOF updates, but rather due to inability o indexing structures to resist curse o dimensionality [7].

9 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 #LOF updates 7 Dimension k #LOF updates 7 Dimension k Fig.. The dependence o number o LOF updates on the value o parameter k or dierent number o dimensions D. The results are averaged over datasets with records generated rom standardized normal distribution. a Linear abscise; b Abscise proportional to k. B. Learning New Behavior Our irst synthetic data set used to analyze ability o incremental LOF to learn new behavior in a non-stationary data corresponded to data records generated rom - modal mixture o -dimensional Gaussian distributions with dierent means. The data set consisted o records generated rom Gaussian distribution N(µ,Σ and records with Gaussian distribution N(µ,Σ, where. µ = [ ], µ = [ + + ], Σ = Σ =.. Fig. shows current LOF value ater inserting data records,, and. At n=, we identiied inserted record rom a new distribution as an outlier as soon as it appeared. However, when the switch to the new distribution was complete, the incremental LOF learned the new distribution as a part o regular behavior (starting rom n= so the new records were correctly labeled as normal. As discussed in Section II, i we used supervised static LOF (trained at n=, the data records rom the new distribution will always be marked as outliers, since they did not appear in original data on which the supervised static LOF was trained (see Fig.. Ater all records were inserted, the result (LOF value or n= was identical to LOF values obtained using the periodic static LOF algorithm. However, the periodic LOF algorithm will identiy all records belonging to the new distribution as normal (although some o them were outliers at the time o insertion. The second synthetic data set consisted o data records belonging to -modal -dimensional Gaussian mixture with same mean but dierent variances. This data set was created to illustrate the attempt o masquerading (hiding within existing distribution, see Fig.. The data set had data records o Gaussian distribution N(µ,Σ and data records with Gaussian distribution N(µ,Σ, where. µ = µ = µ = [ ], Σ =, Σ =.. Fig. shows current LOF values obtained using the incremental LOF algorithm, ater insertion o records,, and. Initially (at n=, the computed LOF values do not reveal any change o behavior. However, while additional data records continue to appear within a new distribution, their lrd values will start to increase (larger local density o the new distribution. According to Eq. ( this act will cause increase o LOF values or points rom the old distribution that are on the borders o the new distribution (e.g., or n=, the maximal LOF value becomes larger than. Since incremental LOF algorithm keeps track o updates o LOF values or existing data records (already in the database, this phenomenon is easy to identiy. Thereore, it is apparent that the incremental LOF algorithm is capable o identiying the masquerading attempt as well as an approximate time when the attempt begins! As already discussed in Section II, supervised static LOF (trained at n= on records rom the irst distribution is not capable o identiying this behavior, since new data records (corresponding to new, much denser distribution are not considered when computing lrd No points - No points No points - No points Fig.. The result o incremental LOF algorithm (k= on dataset consisting o records o Normal distribution N(µ,Σ ollowed by records with Normal distribution N(µ,Σ. µ = [ ], µ = [ + + ], Σ = diag(.,. ater a n=; b n=; c n=; d n= points inserted. New data records (inserted at time instant n are marked red. In contrast, periodic LOF method will be able to identiy the masquerading, but this identiication will be delayed or the period o LOF update. C. Experiments on real lie data sets To illustrate the ability o incremental LOF algorithm to identiy outliers in dynamic environment, we irst selected two real lie data sets containing video sequences. The irst data set is composed o video rames (data is available at: The eatures rom video rames are extracted using the procedure described in []. Our goal was to identiy sudden changes in selected video rames, which is important problem in video analysis [], especially in analysis o streaming videos (e.g., video-surveillance. Analyzing Fig., it can be observed that the proposed incremental LOF correctly detected all the sudden changes in video rames, while not producing any alse alarm. These changes were caused by the appearance o a new object (rame, zooming objects to camera (rame and novel video content (rames,, 7, 9. On the other hand, static periodic LOF algorithm computed ater all rames did not detect any o these rames as outliers, while - - -

Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 supervised LOF algorithm had very large alse alarm rate due to data non-stationarity.

10 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 supervised LOF algorithm had very large alse alarm rate due to data non-stationarity. LOF Dynamic LOF Identiied transitions Static LOF No points No points No points No points Fig.. The result o incremental LOF algorithm (k= on dataset consisting o recordsaussian distribution N(µ,Σ and records with Gaussian distribution N(µ,Σ. µ=[ ], Σ =diag(,, Σ =diag( -, -, ater a ; b ; c ; d data records inserted. Finding unusual trajectories in surveillance videos may lead to early identiication o illegal behavior and hence the outliers need to be detected as soon as they appear. In such applications, due to data non-stationarity, static LOF algorithm is not suitable. To identiy outliers in this environment, we have perormed experiments on the second real lie data set. The dataset corresponds to 9 video motion trajectories, where only trajectories (, 7 are visually identiied as unusual behavior (person walking right and then back let and person walking very slowly. The trajectories were extracted rom IR surveillance videos using our motion detection and tracking algorithms [9]. Each trajectory was represented by ive equidistant points in [x,y,time] space (two spatial coordinates on the rame and the time instant and the dimensionality o this eature vector was urther reduced to three using the principal component analysis []. The dataset is available at Results o outlier detection on IR video trajectories are shown in Fig.. Fig. a shows computed values o LOF(t using the incremental LOF algorithm, or each trajectory t at the moment o its insertion into the database. As it can be observed, the LOF values or trajectories and 7 are signiicantly larger than or the rest o the trajectories, thus clearly indicating unusual behavior. To urther illustrate perormance o incremental LOF method, we plotted all considered trajectories in [x,y,time] space (Fig. b, colorcoded by computed LOF value. It can be seen that the most intense color (and the highest LOF corresponds to clear outliers video rames 7 9 Fig.. Result o applying incremental (blue asterisks and static periodic (red x LOF algorithm (k= on video sequences rom our test movie. For the static LOF algorithm, LOF values that are shown or each rame t are computed when the latest data record is inserted. The third data set was 99 DARPA Intrusion Detection Evaluation Data []. The original DARPA 9 data contains two types: training data and test data. The training data consists o seven weeks o network-based attacks inserted in the normal background data. Attacks in training data are labeled. The test data contain two weeks o network-based attacks and normal background data. Seven weeks o data resulted in about ive million connection records. Since the amount o available data is huge (e.g. some days have several million connection records, we have chosen only the data rom ith week to present our indings. We used tcptrace utility sotware [] as the packet iltering tool in order to extract inormation about packets rom TCP connections and to construct new eatures. The DARPA 9 training data includes list iles that identiy the time stamps (start time and duration, service type, source IP address and source port, destination IP address and destination port, as well as the type o each attack. We used this inormation to map the connection records rom list iles to the connections obtained using tcptrace utility sotware and to correctly label each connection record as normal or an attack type. The same technique was used to construct KDDCup 99 data set [], but this data set did not keep time inormation about the attacks. The main reason or this procedure is to associate new constructed eatures with the connection records rom list iles and to create more inormative data set or learning. The complete list o the eatures extracted rom raw tcpdump data using tcptrace sotware is available in our previously published results []. The results o applying the incremental LOF algorithm on DARPA9 data set are shown in Fig.. We have chosen to illustrate only two phenomena here. The irst phenomenon corresponds to time moment t =, when new behavior starts to appear. This behavior was clearly detected by incremental LOF algorithm (Fig, dash-dot cyan line. Although detected behavior did not correspond to any intrusions, it was important to include it when updating

11 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 characteristics o normal behavior. The second phenomenon at time moment t = 97, corresponds to detection o Denial o Service (DoS attack (Fig., red dashed line. The LOF value at this time moment spikes very high, thus resulting in immediate DoS detection. I the static periodic LOF algorithm was applied ater data records have been inserted into the data set, this DoS attack would not be detected (Fig, black dash-dot line, since it would create a new data distribution (scenario (Fig. in section II. LOF (k= LOF (k= trajectory time (rames a b trajectory LOF Fig.. Incremental LOF algorithm used or identiying unusual trajectories in motion videos (k=; a The values o LOF valuer obtained or each trajectory; b Trajectories in [x,y,time] eature space colored according to the values o LOF value. V. CONCLUSIONS AND FUTURE WORK A ramework or incremental density-based outlier detection scheme is presented. The proposed algorithm has the same detection perormance as the static iterated LOF algorithm that is applied ater insertion o each data record, but it is much more computationally eicient. Experimental results on several synthetic and real lie data sets rom video and intrusion detection domains indicate that the proposed incremental LOF algorithm can provide additional unctionality that is diicult to achieve using static variants o LOF algorithm, including detection o new behavior as well as identiication o masquerading outliers. LOF.. 9 time snapshot. 97 time Fig.. Instant LOF values or all the data records in the dataset at time instants, 97, obtained using incremental LOF algorithm (k= on Network Intrusion dataset. The act that the number o updates in the incremental LOF algorithm per insertion/deletion o a single data record does not depend on the total number o data records is quite appealing or its use in real-time data stream applications. However, its perormance crucially depends on eicient indexing structures to support k-nearest neighbor and reverse k-nearest neighbor queries. Due to limitations o existing indexing structures with high data dimensionality, the proposed incremental LOF (similar as static LOF is not applicable when the data have large number o dimensions. The approximate k-nn and reverse k-nn algorithms might improve the applicability o incremental LOF with multidimensional data. Future work on deleting data records rom database is needed. More speciically, it would be interesting to design an algorithm with exponential decay o weights, where the most recent data records will have the highest inluence on the local density estimation. In addition, an extension o the proposed methodology to create incremental versions o other emerging outlier detection algorithms, (e.g., Connectivity Outlier Factor (COF [], LOCI [], is also worth considering. Additional real-lie data sets will be used to evaluate the proposed algorithm and ROC curves [] will be applied to quantiy the algorithm perormance. ACKNOWLEDGMENTS D. Pokrajac has been partially supported by NIH (grant # P RR7-, DoD/DoA (award 9-MA-ISP and NSF (awards # 99, #HRD-, #HRD-. Authors would also like to thank Roland Miezianko, Terravic Corp. or providing IR surveillance video, Brian Tjaden, Wellesley College, or discussion about reverse k-nn queries and David Mount, University o Maryland, or insightul discussion about sphere covering veriication. REFERENCES [] M. Joshi, R. Agarwal, V. Kumar, PNrule, Mining Needles in a Haystack: Classiying Rare Classes via Two-Phase Rule Induction, In Proceedings o the ACM SIGMOD Conerence on Management o Data, Santa Barbara, CA, May. [] N. Chawla, A. Lazarevic, L. Hall, K. Bowyer, SMOTEBoost: Improving the Prediction o Minority Class in Boosting, In Proceedings o the

Incremental Connectivity-Based Outlier Factor Algorithm

Incremental Connectivity-Based Outlier Factor Algorithm Dragoljub Pokrajac 1, Natasa Reljin 2, Nebojsa Pejcic 2, Aleksandar Lazarevic 3 1 CIS, AMTP and CREOSA, Delaware State University, 12 North DuPont