Incremental Local Outlier Detection for Data Streams

Size: px
Start display at page:

Download "Incremental Local Outlier Detection for Data Streams"

Transcription

1 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 Incremental Local Outlier Detection or Data Streams Dragoljub Pokrajac CIS Dept. and AMRC Delaware State University Dover DE 99 Aleksandar Lazarevic United Tech. Research Center Silver Lane, MS 9- East Hartord, CT, USA Longin Jan Latecki CIS Department. Temple University Philadelphia, PA 9 Abstract. Outlier detection has recently become an important problem in many industrial and inancial applications. This problem is urther complicated by the act that in many cases, outliers have to be detected rom data streams that arrive at an enormous pace. In this paper, an incremental LOF (Local Outlier Factor algorithm, appropriate or detecting outliers in data streams, is proposed. The proposed incremental LOF algorithm provides equivalent detection perormance as the iterated static LOF algorithm (applied ater insertion o each data record, while requiring signiicantly less computational time. In addition, the incremental LOF algorithm also dynamically updates the proiles o data points. This is a very important property, since data proiles may change over time. The paper provides theoretical evidence that insertion o a new data point as well as deletion o an old data point inluence only limited number o their closest neighbors and thus the number o updates per such insertion/deletion does not depend on the total number o points N in the data set. Our experiments perormed on several simulated and real lie data sets have demonstrated that the proposed incremental LOF algorithm is computationally eicient, while at the same time very successul in detecting outliers and changes o distributional behavior in various data stream applications. I. INTRODUCTION Despite the enormous amount o data being collected in many scientiic and commercial applications, particular events o interests are still quite rare. These rare events, very oten called outliers or anomalies, are deined as events that occur very inrequently (their requency ranges rom % to less than.% depending on the application. Detection o outliers (rare events has recently gained a lot o attention in many domains, ranging rom video surveillance and intrusion detection to raudulent transactions and direct marketing. For example, in video surveillance applications, video trajectories that represent suspicious and/or unlawul activities (e.g. identiication o traic violators on the road, detection o suspicious activities in the vicinity o objects represent only a small portion o all video trajectories. Similarly, in the network intrusion detection domain, the number o cyber attacks on the network is typically a very small raction o the total network traic. Although outliers (rare events are by deinition inrequent, in each o these examples, their importance is quite high compared to other events, making their detection extremely important. Data mining techniques developed or this problem are based on both supervised and unsupervised learning. Supervised learning methods typically build a prediction model or rare events based on labeled data (the training set, and use it to classiy each event [, ]. The major drawbacks o supervised data mining techniques include: ( necessity to have labeled data, which can be extremely time consuming or real lie applications, and ( inability to detect new types o rare events. In contrast, unsupervised learning methods typically do not require labeled data and detect outliers as data points that are very dierent rom the normal (majority data based on some measure []. These methods are typically called outlier/anomaly detection techniques, and their success depends on the choice o similarity measures, eature selection and weighting, etc. They have the advantage o detecting new types o rare events as deviations rom normal behavior, but on the other hand they suer rom a possible high rate o alse positives, primarily since previously unseen (yet normal data can be also recognized as outliers/anomalies. Very oten, data in many rare events applications (e.g. network traic monitoring, video surveillance, web usage logs arrives continuously at an enormous pace thus posing a signiicant challenge to analyze it []. In such cases, it is important to make decisions quickly and accurately. I there is a sudden or unexpected change in the existing behavior, it is essential to detect this change as soon as possible. Assume, or example, there is a computer in the local area network that uses only limited number o services (e.g., Web traic, telnet, tp through corresponding ports. All these services correspond to certain types o behavior in network traic data. I the computer suddenly starts to utilize a new service (e.g., ssh, this will certainly look like a new type o behavior in network traic data. Hence, it will be desirable to detect such behavior as soon as it appears especially since it may very oten correspond to illegal or intrusive events. Even in the case when this speciic change in behavior is not necessary intrusive or suspicious, it is very important or a security analyst to understand the network traic and to update the notion o the normal behavior. Further, on-line detection o unusual behavior and events also plays a signiicant role in video and image analysis [-]. Automated identiication o suspicious behavior and objects (e.g., people crossing the perimeter around protected areas, leaving unattended luggage at the airport installations, cars driving unusually slow or unusually ast or with unusual trajectories based on inormation extracted rom video streams is currently an active research area. Other potential applications include traic control and surveillance o commercial and residential buildings. These tasks are characterized by the need or realtime processing (such that any suspicious activity can be identiied prior to making harm to people, acilities and installations and by dynamic, non-stationary and oten noisy environment. Hence, there is necessity or incremental outlier detection that can adapt to novel behavior and provide timely identiication o unusual events. --7-/7/$. 7 IEEE

2 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 Recently, LOF (Local Outlier Factor algorithm [9] has been successully applied in many domains or outlier detection in a batch mode [, ]. In this paper, we propose a novel incremental LOF algorithm that is appropriate or detecting outliers in data streams. The proposed incremental LOF algorithm is the irst incremental outlier detection algorithm to the best o our knowledge. It provides equivalent detection perormance as the static LOF algorithm, and has O(NlogN time complexity, where N is the total number o data points. The paper shows that insertion o new data points as well as deletion o obsolete points inluence only limited number o their nearest neighbors and thus insertion/deletion time complexity per data point does not depend on the total number o points N. Our experiments perormed on several simulated and real lie data sets have demonstrated that the proposed incremental LOF algorithm can be very successul in detecting outliers in various data streaming applications. II. BACKGROUND Outlier detection techniques [] can be categorized into our groups: ( statistical approaches; ( distance based methods; ( proiling methods; and ( model-based approaches. In statistical techniques [,, 7], the data points are typically modeled using a stochastic distribution, and points are labeled as outliers depending on their relationship with this model. Distance based approaches [, 9, ] detect outliers by computing distances among points. Several recently proposed distance based outlier detection algorithms are based on ( computing the ull dimensional distances among points using all the available eatures [] or only eature projections []; and ( on computing the densities o local neighborhoods [9, ]. In addition, clustering-based techniques have also been used to detect outliers either as side products o the clustering algorithms (points that do not belong to clusters [] or as clusters that are signiicantly smaller than others []. In proiling methods, proiles o normal behavior are built using dierent data mining techniques or heuristic-based approaches, and deviations rom them are considered as outliers (e.g., network intrusions. Finally, model-based approaches usually irst characterize the normal behavior using some predictive models (e.g. replicator neural networks [] or unsupervised support vector machines [, ], and then detect outliers as deviations rom the learned model. Initially proposed outlier detection algorithms determine outliers once all the data records (samples are present in the dataset. We reer to these algorithms as static outlier detection algorithms. In contrast, incremental outlier detection techniques [, 9, ] identiy outliers as soon as new data record appears in the dataset. Incremental outlier detection was also used within more general ramework o activity monitoring []. In addition, Domingos and Hulten [9] proposed broad requirements that incremental algorithms need to meet, while Yamanishi and Takeuchi [] used on-line discounting distributional learning o Gaussian mixture model and scoring based on the estimated probability density unction. In this study, we use propose an incremental outlier detection algorithm based on computing the densities o local neighborhoods. In our previous work [], we have experimented with numerous outlier detection algorithms or network intrusion identiication, and we have concluded that the local density based outlier detection approach (e.g. LOF typically achieved the best prediction perormance. The main idea o the LOF algorithm [9] is to assign to each data record a degree o being outlier. This degree is called the local outlier actor (LOF o a data record. Data records (points with high LOF have local densities smaller than their neighborhood and typically represent stronger outliers, unlike data points belonging to uniorm clusters that usually tend to have lower LOF values. The algorithm or computing the LOFs or all data records has the ollowing steps:. For each data record q compute k-distance(q as distance to the k-th nearest neighbor o q (or deinitions, see Section III.. Compute reachability distance or each data record q with respect to data record p as ollows: reach-dist k (q,p= max(d(q,p, k-distance(p ( where d(q,p is Euclidean distance rom q to p.. Compute local reachability density (lrd o data record q as inverse o the average reachability distance based on the k nearest neighbors o the data record q (In original LOF publication [9], parameter k was named MinPts. lrd( q =. ( reach dist ( q, p / k p knn ( q k. Compute LOF o data record q as ratio o average local reachability density o q s k nearest neighbors and local reachability density o the data record q. lrd( p ( ( k p knn q LOF q =. ( lrd( q The main advantages o LOF approach over other outlier detection techniques include: - It detects outliers with respect to density o their neighboring data records; not to the global model. - It is able to detect outliers regardless the data distribution o normal behavior, since it does not make any assumptions about the distributions o data records. In order to ully justiy the need or incremental outlier detection techniques, it is important to understand that applying static LOF outlier detection algorithms to data streams would be extremely computationally ineicient and/or very oten may lead to incorrect prediction results. Namely, static LOF algorithm may be applied to data streams in three ways:. Periodic LOF. Apply LOF algorithm on the entire data set periodically (e.g., ater every data block o data records is inserted, similar to the strategy discussed in [9] or ater all the data records are inserted. The major problem o this approach is inability to detect outliers related to the beginning o new behavior that initially appear within the inserted block. Fig. illustrates this scenario. Assume that a new data point d n (red asterisk in Fig. a is added to the original data distribution (blue dots in Fig. a. Initially, point d n is an outlier since it is distinct rom all other data records. However, when additional data records (red asterisks in Fig. b start to group around the initial data record d n, these new points are no longer outliers since they

3 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 orm their own cluster. Ideally, at the time o its insertion, data record d n should be identiied as an outlier []. However, in this periodic scenario, the LOF algorithm is applied when all data records are added to the original data set and thus already ormed the new distribution. Hence, all the points rom the new cluster, including d n, will be identiied as normal behavior! On the other hand, i an incremental LOF algorithm is applied ater every new data instance is inserted into the data set, not only it is possible to detect points like d n as outliers but also to detect the moment in time when this change o behavior occurred. d n a b Fig.. Inability o static LOF algorithms to identiy change o behavior: a Point d n can be correctly identiied as outlier by supervised LOF algorithm but not by periodic LOF; b Points belonging to the new distribution are incorrectly identiied as outliers by supervised LOF.. Supervised LOF. Given a training data set D at time interval t, LOF algorithm is irst applied to compute the k- distances, lrd and LOF values or all data records rom the training data set D. For every time interval, t > t, when a new data record d n is inserted into the data set, k-distance, reachability distances and lrd values are computed or the new record d n. However, when computing LOF(d n using Eq. (, lrd(d n is used along with pre-computed lrd values or the original data set D. It is apparent that this approach will result in several problems: (i Estimated value LOF(d n will not be accurate, since it uses pre-computed k-distance, reach-dist and lrd values; (ii a new behavior (shown as red asterisks in Fig. b will always be detected as outlier since this approach does not update the normal behavior proile; (iii masquerading (attempt o hiding within existing distribution cannot be identiied, since all inserted data points will always be considered as normal as they belong to normal distribution (Fig.. Namely, assume that a new data point d n (red square in Fig. a is inserted within existing data distribution and all new data points start to group around the point d n (red squares in Fig. b but with much higher density than the original data distribution. Apparently, these newly added points will orm a cluster o very high density which is substantially dierent than the cluster o the original distribution. Supervised LOF approach considers these points to belong to the original data distribution, since it is not aware o new data points orming the dense cluster. On the other hand, incremental LOF algorithm, ater insertion o each new data point, would identiy this phenomenon, since it can take into account the newly added points, when updating lrd and LOF values o existing points (that are already in the database. d a b Fig.. Detecting masqueraders (hiding within existing distribution. Iterated LOF. Re-apply the static LOF algorithm every time a new data record d n is inserted into the data set. This static LOF algorithm does not suer rom aorementioned problems, but is extremely computationally expensive, since every time a new point is inserted, the algorithm recomputes LOF values or all the data points rom the data set. Knowing that time complexity o LOF algorithm is O(n log n [9], where n is the current number o data records in the data set, total time complexity or this iterated approach, ater insertion o N points, is : O N n log n = O( N log N, ( n= Our proposed incremental LOF algorithm is designed to provably provide the same prediction perormance (detection rate and alse alarm rate as the iterated LOF. It is achieved by consistently maintaining or all existing points in the database the same LOF values as the iterated LOF algorithm. Our proposed incremental LOF algorithm eiciently addresses the problems mentioned in Fig. and, but has time complexity O(N logn thus clearly outperorming the static iterated LOF approach. Ater all N data records are inserted into the data set, the inal result o the incremental LOF algorithm on N data points is independent o the order o insertion and equivalent to the periodic LOF executed ater all the data records are inserted. III. METHODOLOGY When designing incremental LOF algorithm, we have been motivated by two goals. First, the result o the incremental algorithm must be equivalent to the result o the static algorithm every time t a new point is inserted into a data set. Thus, there would not be a dierence between applying incremental LOF and the periodic static LOF when all data records up to time instant t are available. Second, asymptotic time complexity o incremental LOF algorithm has to be comparable to the static LOF algorithm. In order to have easible incremental algorithm, it is essential that, at any time moment t, insertion/deletion o the data record results in limited (preerably small number o updates o algorithm parameters. Speciically, the number o updates per each insertion/deletion must not be dependent on the current number o records in the dataset; otherwise, the perormance o the incremental LOF algorithm would be Ω(N where N is the inal size o the dataset. In this section, we demonstrate eicient insertion and deletion o records in the incremental LOF algorithm and provide its exact time complexity analysis.

4 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 A. Incremental LOF algorithm The proposed incremental LOF algorithm computes LOF value or each data record inserted into the data set and instantly determines whether inserted data record is outlier. In addition, LOF values or existing data records are updated i needed. i. Insertion. In the insertion part, the algorithm perorms two steps: a insertion o new record, when it computes reach-dist, lrd and LOF values o a new point; b maintenance, when it updates k-distances, reach-dist, lrd and LOF values or aected existing points. Let us irst illustrate these steps through the example o inserting a new data point n into a data set shown on Fig. a. I we assume k =, we irst need to compute reachability distances to two nearest neighbors o the new data point n (data points, in Fig. a, so that its lrd value can be computed. As it is shown urther in the paper (Theorem, insertion o the point n may decrease the k- distance o certain neighboring points, and it can happen only to those points that have the new point n in their k- neighborhood. Hence, we need to determine all such aected points, (points,,, have point n in their -neighborhood, see Fig. a. According to Eq. (, when k-distance(p changes or a point p, reach-dist(q,p will be aected only or points q that are in k-neighborhood o the point p. In our example, previous -neighbors o data point are the data points, and, so reach-dist(,, and reach-dist(, will be updated (Fig. b. According to Eq. (, lrd value o a point q is aected i: a the k-neighborhood o the point q changes or b reach-dist rom point q to one o its k-neighbors changes. The -neighborhood o a point will change only i the new point n becomes one o its -neighbors. Hence, we need to update lrd on all points to which the point n is now one o their - neighbors (points,,, in Fig. b and on all points q where reach-dist(q,p is updated and p is among -nearest neighbors o q (points,,7 in Fig. c. According to Eq. (, LOF values o an existing point q should be updated i lrd(q is updated (points,,,,,,7 in Fig. d or lrd(p o one o its -neighbors p changes (points,9, In Fig d. Note that LOF value o point is not updated since point (where lrd is updated is not among its nearest neighbors. The general ramework or the incremental LOF method is shown in Fig.. As in the static LOF algorithm [9], we deine k-th nearest neighbor o a record p as a record q rom the dataset S such that or at least k records o S \ {p} it holds that d(p,o d(p,q, and or at most k- records o S \ {p} holds that d(p,o < d(p,q, where d(p,q denotes Euclidean distance between data records p and q. We reer d(p,q as k- distance(p. k nearest neighbors (reerred to as knn(p include all points r S \ {p} such that d(p,r d(p,q. We also deine k reverse nearest neighbors o p ( reerred to as krnn (p as all points q or which p is among their k nearest neighbors. For a given data record p, knn(p and krnn(p can be respectively retrieved by executing nearest-neighbor and reverse (a.k.a. inverse nearest neighbor queries [7-,]on a dataset S. The correctness o the insertion algorithm is based on the ollowing Theorems -. Theorem. The insertion o point p c aects the k-distance at points p j that have point p c in their k-nearest neighborhood, i.e., where p j krnn(p c. New k-distances o the aected points p j are updated as ollows: ( d( p j, pc, pc is the k - th nearest neighbor o p new j k distance ( p j = ( ( old ( k distance ( p j, otherwise. Proo. (sketch. In insertion, k-distance o an existing point p j changes when a new point enters the k-th nearest neighborhood o p j, since in this case the k-neighborhood o p j changes. I a new point p c is the new k-th nearest neighbor o p j, its distance rom p j becomes the new k-distance(p j. Otherwise, old k-th neighbor o p j becomes the new k-th nearest neighbor o p j (see Fig.. Corollary. During insertion, k-distance cannot increase, ( i.e., distance ( distance ( ( new old k p k p. j j Theorem. Change o k-distance(p j may aect reach-dist k (p i,p j or points p i that are k-neighbors o p j. (see Fig b. Proo (sketch. Using (, p i d(p i, p j > k-distance (old (p j, reach-dist (old k (p i,p j = d(p i, p j. According to Corollary, k- distance(p j cannot increase, hence i d(p i, p j > k- distance (old (p j, reach-dist (new k (p i,p j = reach-dist (old k (p i,p j. Theorem. lrd value needs to be updated or every record (denoted with p m in Fig. or which its k-neighborhood changes or or which reachability distance to one o its knn changes. Hence, ater each update o reach-dist k (p i,p j we have to update lrd(p i i p j is among knn(p i. Also, lrd is updated or all points p j whose k-distance was updated. Proo (sketch. Change o k-neighborhood o p m aects the scope o the sum in Eq. ( computed or all k-neighbors o p m. Change o the reachability distance between p m and some o its k-nearest neighbors aects corresponding term in the denominator o Eq. (. Theorem. LOF value needs to be updated or all data records p m which lrd has been updated (since lrd(p m is a denominator in Eq. ( and or those records that have records p m in their knns. Hence, the set o data records where LOF needs to be updated (according to ( corresponds to union o records p m and their krnn. Proo (sketch. Similar to the proo o Theorem, using (. ii. Deletion. In data stream applications it is sometimes necessary to delete certain data records (e.g., due to their obsoleteness. Very oten, not only a single data record is deleted rom the data set, but the entire data block that may correspond to particular outdated behavior. Similarly like in an insertion, upon deleting the block o data records S delete, there is a need to update parameters o the incremental LOF algorithm. The general ramework or deleting the block o data records S delete rom the dataset S is given in Fig.. The deletion o each record p c S delete rom dataset S inluences the k-distances o its krnn. k-neighborhood increases or each data record p j that is in reverse k-nearest neighborhood o p c. For such records, k-distance(p j becomes equal to the distance rom p j to its new k-th nearest neighbor. The reachability distances rom p j s (k- nearest neighbors p i to p j need to be updated. Observe that the reachability distance rom the k-th neighbor o p j to record p j is already equal to their Euclidean distance 7

5 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM n n 7 7 (a (b 9 9 n n 7 7 (c (d Fig.. The illustration o the proposed incremental LOF algorithm. a Insertion o a new point n (red results in computation o the reachability distance to its two nearest neighbors, (cyan and to update o -distance to reverse nearest neighbors o n (,,,, yellow. b Reachability distance reach-dist(q,p is updated to all reverse k-neighbors o the point n rom their -neighbors (blue arrows rom q to p. c lrd is updated or all points where -distance is updated and or points which reachability distance to their -neighbor changes (green. d LOF is updated or points where lrd is updated and or points where lrd o one o their - neighbors is updated (pink. Incremental LOF_insertion(Dataset S Given: Set S {p,,p N } p i R D, where D corresponds to the dimensionality o data records. For each data point p c in data set S insert(p c Compute knn(p c ( p j knn(p c compute reach-dist k (p c,p j using Eq. (; //Update_neighbors o p c S update_k_distance =krnn(p c ; ( p j S update_k_distance update k-distance(p j using Eq.(; S update_lrd = S update_k_distance ; ( p j S update_k_distance, ( p i knn(p j \{p c } reach-dist k (p i,p j =k-distance(p j ; i p j knn(p i S update_lrd = S update_lrd {p i }; S update_lof = S update_lrd ; ( p m S update_lrd update lrd(p m using Eq. (; S update_lof = S update_lof krnn(p m ; ( p l S update_lof update LOF(p l using Eq.(; compute lrd(p c using Eq.(; compute LOF(p c using Eq.(; End //or Fig.. The general ramework or insertion o data record and computing its LOF value in incremental LOF algorithm. d(p i, p j and does not need to be updated (Fig. 7. Analog to insertion, lrd value needs to be updated or all points p j where k-distance is updated. In addition, lrd value needs to be updated or points p i such that p i is in knn o p j and p j is in knn o p i. Finally, LOF value is updated or all points p m on which lrd value is updated as well as on their krnn. The correctness o the deletion algorithm can be proven analog to the correctness o the insertion algorithm. p c d(p j, p c -distance (old ( p j -distance (old (p j (a p j -distance (new (p j -distance (old (p j -distance (new (p j p j p c d(p j, p c < -distance (old (p j Fig.. Update o k-nearest neighbor distance upon insertion o a new record (k=. a New record p c is not among -nearest neighbors o record p j - distance(p j does not change; b New record p c is among -nearest neighbors o p j -distance(p j decreases. Cyan dashed lines denote updates o reachability distances between point p j and two old points. B. Computational eiciency o the incremental LOF algorithm To determine time complexity o the proposed incremental LOF algorithm, it is essential to demonstrate that the number o aected data records (updates o k-distance, reachability distances, lrd and LOF values does not depend on the current number n o records in the dataset, as stated by Theorems -. Subsequently, Corollaries - provide asymptotic time complexity or the proposed algorithm. (b

6 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 Incremental LOF_deletion(Dataset S,S delete S update_k_distance = ; ( p c S delete S update_k_distance = S update_k_distance krnn(p c ; delete(p c ; //we can delete p c ater inding // all reverse neighbors S update_k_distance = S update_k_distance \S delete ; //points rom S delete may still be present when computing reverse k- nearest neighbors ( p j S update_k_distance update k-distance(p j ; S update_lrd = S update_k_distance ; ( p j S update_k_distance ( p i (k-nn(p j reach-dist k (p i,p j =k-distance(p j ; i p j knn(p i S update_lrd = S update_lrd {p i }; S update_lof = S update_lrd ; ( p m S update_lrd update lrd(p m using Eq. (; S update_lof = S update_lof krnn(p m ; ( p l S update_lof update LOF(p l using Eq.(; return Fig.. The ramework or deletion o data record in incremental LOF method. -distance old(p j (a p j p c (b Fig. 7. Update o k-nearest neighbor distance upon deletion o record p c (k=. a Prior to deletion, data record p c is among -nearest neighbors o record p j; b Ater deletion o p c, -distance(p j increases and reachability distances rom two nearest neighbors o p j (denoted by cyan dashed lines are updated to - distance(p j. Theorem. Maximal number F o k reverse nearest neighbors o a record p is proportional to k, exponentially proportional to the dimensionality D, and does not depend on n. To prove Theorem, we will irst establish a ew deinitions [7] and prove Lemmas, necessary to establish D- dimensional geometry o the space where the considered data records reside. Deinition. The cone C in D-dimensional space with vertex v and axis l is locus o points x such that the Euclidean distance o the point x to the vertex d(x,v is proportional to the distance d(x,v o point s projection x onto l to the vertex v. A line containing a point x on cone and the origin is called generatix. When the vertex is at the origin o the coordinate system (v=o we reer a cone as centered cone. p j -distance new(p j Fig.. Illustration o lemma in three-dimensional space Deinition. A hal-axis angle o a cone is an angle α/= xvx. It is angle between the axis and any generatrix. Lemma. Coordinates X, X,, X D o a point x on the centered cone C, where the axis l o the cone is parallel to the x axis o the coordinate system, satisy: X + X + X D =a X, a > ( where a is a pre-speciied parameter. Hal-axis angle o C satisies the ollowing condition: α X. (7 cos = = D X + a i= i B Proo. Follows directly rom deinitions and and the act that the length o x is X when l is parallel to x. Deinition. All points which coordinates satisy relation X + X + X D <a X are inside the cone C and comprise set inside(c. Deinition. [] The ball B(c,R in D-dimensional space is locus o points x such that the distance o the point x rom a prespeciied point c is smaller or equal R>. Lemma. Consider cone C with vertex p and hal-axis angle α/. Consider ball B(p, R, and volume V=B C. ( p C d(p,p=r V B (p,r (See Fig. or treedimensional illustration. Proo. (sketch. Without loss o generality, we may assume that the cone is centered, i.e., that p is at the origin. Let coordinates o point p be X, X,, X D. Consider a point p symmetric to p with respect to the x axis. The point p has coordinates X, -X,, -X D. It is easy to observe that p C and d(p,p =R. The distance between these two points is d ( p', p'' X +... X D = ax p B R α= =. Since points p, p are on the sphere with radius R, rom Eq. (7 we can obtain α d ( p', p'' = Rsin Rsin = R. Thereore, the ball B (p, R contains the point p antipodal to p. It can be shown that B also contains any other boundary point o V. Deinition. A rame o cones C i, i =,,h is deined as a set o h cones with the common vertex p and common angle X V x r p R p 9

7 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 α. The rame o cones completely cover the D-dimensional Euclidean space E D i: U h D C i = E (see Fig. 9. i= Lemma. The lower bound or the number o α = cones sin D ϕ dϕ in a D-dimensional rame is D h = = Θ( D min π π / sin D ϕ dϕ Proo (sketch. The lower bound o the number o cones in a rame is equal to the ratio o the area o the hypersphere and the area o the hyperspherical cap (part o the hypersphere inside the cone with angle α. Details are presented in []. Note that h min depends only on geometry o D-dimensional space and is independent o the number or the placement o D- dimensional data records. The ollowing Deinition and Corollary link geometry o D-dimensional cones to proximity notion in D-dimensional datasets. Deinition. Let S be set o the D-dimensional data records inside a cone C with vertex p. k-nearest neighbor o point p in the cone C is a record q rom the dataset S such that or at least k records o S \{p} it holds that d(p,o d(p,q, and or at most k- records o S \{p} holds that d(p,o <d(p,q. We also deine k-distance C (p as a distance rom record p to its k- th nearest neighbor q in the cone. k-nearest neighborhood o p in cone C, (reerred to as knn c (p includes all records r S \ {p} such that d(p,r d(p,q. Corollary. Consider cone C centered at data point p with α. Let p be a data point in cone C. ( p' inside( C p' knnc ( p p knnc ( p' Proo. (sketch. Consider ball B(p, k-distance C (p. According to the Deinition, the volume V=B C contains exactly k points other than p. Consider data point p such that d(p,p > k-distance C (p. Consider now ball B (p, k- distance C (p. According to Lemma, this ball has as a subset the whole volume V that contains total o k + data points (including p. Hence, k-distance C (p < k-distance C (p. Proo o Theorem. Due to Corollary and Deinitions,, krnn U h i= ( p knn C ( p i, where C i, i=,,h is rame o cones with vertex p. Hence, due to Lemma and Deinition, D krnn p F = knn p hmin = Θ k D.Since neither ( ( ( C i h min nor k depend on n, the number o reverse nearest Suboptimal values or h can be obtained by techniques that construct spherical codes [], ollowed by covering veriication (to ensure that the space around p is completely covered, e.g., based on examination o convex hulls acets []. Analog to the problem o optimal spherical codes, the problem o inding the smallest possible h or arbitrary D is unresolved and is related (but not equivalent to sphere covering problem []. Using the aorementioned suboptimal construction, in [] the upper bound on h is shown or several dimensions: more precisely, h is demonstrated to have upper bound o (see Fig. b,, in R, R, R, correspondingly. neighbors does not depend on the current number o points in the dataset. D spherical cap D (a (b Fig. 9. (a Two -degree D cones (b Spherical caps o the rame consisting o cones that completely cover D space. The ollowing theorems provide the upper bound or the number o points where k-distance, lrd and LOF are updated. Theorem. The number o data records where k-distance needs to be updated is S update_k_distance F or insertion, and S update_k_distance F* S delete, or deletion block o size S delete. Proo (sketch. For insertion/deletion o one data record s, k- distance needs to be updated on all data records that have the inserted/deleted data record in its neighborhood, i.e., on all reverse nearest neighbors o s. Theorem bounds the number o reverse neighbors with F. Theorem 7. Number o data records where lrd is updated is S update_lrd k S update_k_distance. Proo (sketch. lrd values are updated on points rom S update_k_distance and may be updated on their k-nearest neighbors. Theorem. Number o data records where LOF is updated is S update_lof (+F S update_lrd. Proo (sketch. LOF value is updated on data records rom S update_lrd and their reverse nearest neighbors, thus giving the bound stated in the Theorem. The ollowing corollaries provide asymptotic time complexity or the proposed algorithm. Corollary. The asymptotic time complexity or insertion and deletion in incremental LOF is : T incrlof_ins = O(k F T knn + k F T krnn + F k + T insert, ( T incrlof_del =O( S delete (k F T knn + k F T krnn + F k + T delete. Here, T knn and T krnn are time complexities o knn and krnn algorithms respectively, while T insert and T delete correspond to time needed or the insertion and deletion o a data record into/rom the database (including index updating. Proo (sketch. Follows rom the algorithms or insertion and deletion in incremental LOF, given in Fig. and Fig. respectively, and Theorems -. By maintaining list o krnn(p or each record p, the time complexities can be urther reduced to: T incrlof_ins = O(k F T knn + T krnn + F k + T insert ; T incrlof_del = O( S delete (k F T knn + F k + T delete

8 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 Corollary. When eicient algorithms or knn [e.g. ], krnn [e.g., 7-,], as well as eicient indexing structures or inserting/deleting the records [, ] are used (where T knn = T krnn = T insert = T delete = O(log n, the time complexities o T incrlofins and T incrlofdel are logarithmic in the current size n o the database, e.g.,: T incrlofins = O(k F log n + F k. (9 Corollary. Time complexity o the incremental LOF algorithm ater all updates to the dataset o size N are applied is O(N logn. Proo. Directly ollows rom Corollary. Note that according to Theorem, the time complexity o the incremental LOF may exponentially increase with the dimensionality D. However, this is well-known problem o static LOF [9] as well as other density based algorithms and not a particular issue with incremental LOF. IV. EXPERIMENTAL RESULTS Our experiments were perormed on several synthetic data and real lie data sets. In all our experiments, we have assumed that we have inormation about the outliers in the data set, so we could evaluate the detection perormance. In the ollowing subsectios we evaluate time complexity (subsection A and outlier detection accuracy (subsections B, C with respect to ground truth outlier inormation. A. Time Complexity Analysis Our time complexity analysis was perormed on synthetic data sets, since we could better control the total number o data records N in the data set as well as the number o dimensions D. Reported experimental results provide evidence about (i relation between the number o updates or LOF values and the total number o data points N; (ii the dependence o the number o updates or LOF values on LOF parameter k; and (iii the dependence o the number o updates or LOF values on the dimension D. Our synthetic data sets had dierent number o data records (N {,,,}, as well as dierent number o dimensions, (D {,,,,}. For each pair (D, N, we have created data sets with N random records generated rom D-variate distribution. We experimented with both uniorm and standard (zero mean, unit covariance matrix Gaussian distribution. For each o data sets generated or the pair (D, N, we varied the values o the parameter k (,,, o the algorithm and then measured the number o updates or k-distance, reach-dist, lrd and LOF values in the incremental LOF algorithm or insertion o a new data record into the dataset. Here, or each pair (D, N we report average number o LOF updates or all data sets generated using the standard Gaussian distributions. Results obtained or the data sets generated using uniorm distribution are analog and not reported here due to lack o space. Fig. shows how the number o updates o LOF values depends on the total number o data records N (x-axis in Fig. or dierent number o dimensions D (dierent lines in graphs in Fig., where each graph corresponds to distinct value o parameter k. # LOF updates # LOF updates 7 k= Dimension Number o points Dimension k= Number o points # LOF updates # LOF updates 9 7 Dimension k= Dimension Number o points k= Number o points Fig.. The dependence o number o LOF updates on the database size N or dierent number o dimensions D and dierent values o parameter k. The results are averaged over data sets generated rom standardized Gaussian distribution. Analyzing Fig., it can be observed that the number o updates o LOF value stabilizes or suiciently large N, which is in accordance with our theoretical analysis rom section III.B. showing O( updates with respect to N. It is interesting to note that or larger k, the number o data records, necessary to show stabilization o number o LOF updates, is generally larger. However, or typically used values o k (- [,] the number o LOF updates becomes constant or N>. Fig. shows the average number o LOF updates vs. parameter k (each curve corresponds to a particular value o D on database o N = points. The let graph contains abscise in linear scale, while the abscise in the right graph is quadratic (proportional to k. Fig. shows that the actual number o updates seems to change not aster than k. Thereore, the worst-case upper bound O(F k =O(k obtained in Section III.B. seems to be rather pessimistic. This phenomenon is due to the act that in reality, not all updates o reach-dist values result in update o lrd value. Also, a data record may belong to reverse nearest neighbors o multiple records on which lrd has been updated. Hence, such data record, although output rom several krnn queries, will result only in one update o LOF value. Fig. also provides an insight on the dependence o the number o LOF updates on the number o dimensions D. While undoubtedly the number o LOF updates increases with D, it was diicult to conirm (or reject theoretical upper boundary o the exponential dependence (see Section III.B. However, it is evident that the growth o LOF updates with respect to dimensionality D is not explosive (the average number o updates stay bellow even or D=, k=. One o the reasons is that considered upper bound or the number o reverse neighbors is the worst case and is reached rather inrequently. Hence, we anticipate that the dimensionality o the data will not become the bottleneck o the incremental LOF algorithm due to number o LOF updates, but rather due to inability o indexing structures to resist curse o dimensionality [7].

9 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 #LOF updates 7 Dimension k #LOF updates 7 Dimension k Fig.. The dependence o number o LOF updates on the value o parameter k or dierent number o dimensions D. The results are averaged over datasets with records generated rom standardized normal distribution. a Linear abscise; b Abscise proportional to k. B. Learning New Behavior Our irst synthetic data set used to analyze ability o incremental LOF to learn new behavior in a non-stationary data corresponded to data records generated rom - modal mixture o -dimensional Gaussian distributions with dierent means. The data set consisted o records generated rom Gaussian distribution N(µ,Σ and records with Gaussian distribution N(µ,Σ, where. µ = [ ], µ = [ + + ], Σ = Σ =.. Fig. shows current LOF value ater inserting data records,, and. At n=, we identiied inserted record rom a new distribution as an outlier as soon as it appeared. However, when the switch to the new distribution was complete, the incremental LOF learned the new distribution as a part o regular behavior (starting rom n= so the new records were correctly labeled as normal. As discussed in Section II, i we used supervised static LOF (trained at n=, the data records rom the new distribution will always be marked as outliers, since they did not appear in original data on which the supervised static LOF was trained (see Fig.. Ater all records were inserted, the result (LOF value or n= was identical to LOF values obtained using the periodic static LOF algorithm. However, the periodic LOF algorithm will identiy all records belonging to the new distribution as normal (although some o them were outliers at the time o insertion. The second synthetic data set consisted o data records belonging to -modal -dimensional Gaussian mixture with same mean but dierent variances. This data set was created to illustrate the attempt o masquerading (hiding within existing distribution, see Fig.. The data set had data records o Gaussian distribution N(µ,Σ and data records with Gaussian distribution N(µ,Σ, where. µ = µ = µ = [ ], Σ =, Σ =.. Fig. shows current LOF values obtained using the incremental LOF algorithm, ater insertion o records,, and. Initially (at n=, the computed LOF values do not reveal any change o behavior. However, while additional data records continue to appear within a new distribution, their lrd values will start to increase (larger local density o the new distribution. According to Eq. ( this act will cause increase o LOF values or points rom the old distribution that are on the borders o the new distribution (e.g., or n=, the maximal LOF value becomes larger than. Since incremental LOF algorithm keeps track o updates o LOF values or existing data records (already in the database, this phenomenon is easy to identiy. Thereore, it is apparent that the incremental LOF algorithm is capable o identiying the masquerading attempt as well as an approximate time when the attempt begins! As already discussed in Section II, supervised static LOF (trained at n= on records rom the irst distribution is not capable o identiying this behavior, since new data records (corresponding to new, much denser distribution are not considered when computing lrd No points - No points No points - No points Fig.. The result o incremental LOF algorithm (k= on dataset consisting o records o Normal distribution N(µ,Σ ollowed by records with Normal distribution N(µ,Σ. µ = [ ], µ = [ + + ], Σ = diag(.,. ater a n=; b n=; c n=; d n= points inserted. New data records (inserted at time instant n are marked red. In contrast, periodic LOF method will be able to identiy the masquerading, but this identiication will be delayed or the period o LOF update. C. Experiments on real lie data sets To illustrate the ability o incremental LOF algorithm to identiy outliers in dynamic environment, we irst selected two real lie data sets containing video sequences. The irst data set is composed o video rames (data is available at: The eatures rom video rames are extracted using the procedure described in []. Our goal was to identiy sudden changes in selected video rames, which is important problem in video analysis [], especially in analysis o streaming videos (e.g., video-surveillance. Analyzing Fig., it can be observed that the proposed incremental LOF correctly detected all the sudden changes in video rames, while not producing any alse alarm. These changes were caused by the appearance o a new object (rame, zooming objects to camera (rame and novel video content (rames,, 7, 9. On the other hand, static periodic LOF algorithm computed ater all rames did not detect any o these rames as outliers, while - - -

10 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 supervised LOF algorithm had very large alse alarm rate due to data non-stationarity. LOF Dynamic LOF Identiied transitions Static LOF No points No points No points No points Fig.. The result o incremental LOF algorithm (k= on dataset consisting o recordsaussian distribution N(µ,Σ and records with Gaussian distribution N(µ,Σ. µ=[ ], Σ =diag(,, Σ =diag( -, -, ater a ; b ; c ; d data records inserted. Finding unusual trajectories in surveillance videos may lead to early identiication o illegal behavior and hence the outliers need to be detected as soon as they appear. In such applications, due to data non-stationarity, static LOF algorithm is not suitable. To identiy outliers in this environment, we have perormed experiments on the second real lie data set. The dataset corresponds to 9 video motion trajectories, where only trajectories (, 7 are visually identiied as unusual behavior (person walking right and then back let and person walking very slowly. The trajectories were extracted rom IR surveillance videos using our motion detection and tracking algorithms [9]. Each trajectory was represented by ive equidistant points in [x,y,time] space (two spatial coordinates on the rame and the time instant and the dimensionality o this eature vector was urther reduced to three using the principal component analysis []. The dataset is available at Results o outlier detection on IR video trajectories are shown in Fig.. Fig. a shows computed values o LOF(t using the incremental LOF algorithm, or each trajectory t at the moment o its insertion into the database. As it can be observed, the LOF values or trajectories and 7 are signiicantly larger than or the rest o the trajectories, thus clearly indicating unusual behavior. To urther illustrate perormance o incremental LOF method, we plotted all considered trajectories in [x,y,time] space (Fig. b, colorcoded by computed LOF value. It can be seen that the most intense color (and the highest LOF corresponds to clear outliers video rames 7 9 Fig.. Result o applying incremental (blue asterisks and static periodic (red x LOF algorithm (k= on video sequences rom our test movie. For the static LOF algorithm, LOF values that are shown or each rame t are computed when the latest data record is inserted. The third data set was 99 DARPA Intrusion Detection Evaluation Data []. The original DARPA 9 data contains two types: training data and test data. The training data consists o seven weeks o network-based attacks inserted in the normal background data. Attacks in training data are labeled. The test data contain two weeks o network-based attacks and normal background data. Seven weeks o data resulted in about ive million connection records. Since the amount o available data is huge (e.g. some days have several million connection records, we have chosen only the data rom ith week to present our indings. We used tcptrace utility sotware [] as the packet iltering tool in order to extract inormation about packets rom TCP connections and to construct new eatures. The DARPA 9 training data includes list iles that identiy the time stamps (start time and duration, service type, source IP address and source port, destination IP address and destination port, as well as the type o each attack. We used this inormation to map the connection records rom list iles to the connections obtained using tcptrace utility sotware and to correctly label each connection record as normal or an attack type. The same technique was used to construct KDDCup 99 data set [], but this data set did not keep time inormation about the attacks. The main reason or this procedure is to associate new constructed eatures with the connection records rom list iles and to create more inormative data set or learning. The complete list o the eatures extracted rom raw tcpdump data using tcptrace sotware is available in our previously published results []. The results o applying the incremental LOF algorithm on DARPA9 data set are shown in Fig.. We have chosen to illustrate only two phenomena here. The irst phenomenon corresponds to time moment t =, when new behavior starts to appear. This behavior was clearly detected by incremental LOF algorithm (Fig, dash-dot cyan line. Although detected behavior did not correspond to any intrusions, it was important to include it when updating

11 Proceedings o the 7 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 7 characteristics o normal behavior. The second phenomenon at time moment t = 97, corresponds to detection o Denial o Service (DoS attack (Fig., red dashed line. The LOF value at this time moment spikes very high, thus resulting in immediate DoS detection. I the static periodic LOF algorithm was applied ater data records have been inserted into the data set, this DoS attack would not be detected (Fig, black dash-dot line, since it would create a new data distribution (scenario (Fig. in section II. LOF (k= LOF (k= trajectory time (rames a b trajectory LOF Fig.. Incremental LOF algorithm used or identiying unusual trajectories in motion videos (k=; a The values o LOF valuer obtained or each trajectory; b Trajectories in [x,y,time] eature space colored according to the values o LOF value. V. CONCLUSIONS AND FUTURE WORK A ramework or incremental density-based outlier detection scheme is presented. The proposed algorithm has the same detection perormance as the static iterated LOF algorithm that is applied ater insertion o each data record, but it is much more computationally eicient. Experimental results on several synthetic and real lie data sets rom video and intrusion detection domains indicate that the proposed incremental LOF algorithm can provide additional unctionality that is diicult to achieve using static variants o LOF algorithm, including detection o new behavior as well as identiication o masquerading outliers. LOF.. 9 time snapshot. 97 time Fig.. Instant LOF values or all the data records in the dataset at time instants, 97, obtained using incremental LOF algorithm (k= on Network Intrusion dataset. The act that the number o updates in the incremental LOF algorithm per insertion/deletion o a single data record does not depend on the total number o data records is quite appealing or its use in real-time data stream applications. However, its perormance crucially depends on eicient indexing structures to support k-nearest neighbor and reverse k-nearest neighbor queries. Due to limitations o existing indexing structures with high data dimensionality, the proposed incremental LOF (similar as static LOF is not applicable when the data have large number o dimensions. The approximate k-nn and reverse k-nn algorithms might improve the applicability o incremental LOF with multidimensional data. Future work on deleting data records rom database is needed. More speciically, it would be interesting to design an algorithm with exponential decay o weights, where the most recent data records will have the highest inluence on the local density estimation. In addition, an extension o the proposed methodology to create incremental versions o other emerging outlier detection algorithms, (e.g., Connectivity Outlier Factor (COF [], LOCI [], is also worth considering. Additional real-lie data sets will be used to evaluate the proposed algorithm and ROC curves [] will be applied to quantiy the algorithm perormance. ACKNOWLEDGMENTS D. Pokrajac has been partially supported by NIH (grant # P RR7-, DoD/DoA (award 9-MA-ISP and NSF (awards # 99, #HRD-, #HRD-. Authors would also like to thank Roland Miezianko, Terravic Corp. or providing IR surveillance video, Brian Tjaden, Wellesley College, or discussion about reverse k-nn queries and David Mount, University o Maryland, or insightul discussion about sphere covering veriication. REFERENCES [] M. Joshi, R. Agarwal, V. Kumar, PNrule, Mining Needles in a Haystack: Classiying Rare Classes via Two-Phase Rule Induction, In Proceedings o the ACM SIGMOD Conerence on Management o Data, Santa Barbara, CA, May. [] N. Chawla, A. Lazarevic, L. Hall, K. Bowyer, SMOTEBoost: Improving the Prediction o Minority Class in Boosting, In Proceedings o the

Incremental Connectivity-Based Outlier Factor Algorithm

Incremental Connectivity-Based Outlier Factor Algorithm Incremental Connectivity-Based Outlier Factor Algorithm Dragoljub Pokrajac 1, Natasa Reljin 2, Nebojsa Pejcic 2, Aleksandar Lazarevic 3 1 CIS, AMTP and CREOSA, Delaware State University, 12 North DuPont

More information

Chapter 5: Outlier Detection

Chapter 5: Outlier Detection Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.

More information

MATRIX ALGORITHM OF SOLVING GRAPH CUTTING PROBLEM

MATRIX ALGORITHM OF SOLVING GRAPH CUTTING PROBLEM UDC 681.3.06 MATRIX ALGORITHM OF SOLVING GRAPH CUTTING PROBLEM V.K. Pogrebnoy TPU Institute «Cybernetic centre» E-mail: vk@ad.cctpu.edu.ru Matrix algorithm o solving graph cutting problem has been suggested.

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation

More information

Computer Data Analysis and Plotting

Computer Data Analysis and Plotting Phys 122 February 6, 2006 quark%//~bland/docs/manuals/ph122/pcintro/pcintro.doc Computer Data Analysis and Plotting In this lab we will use Microsot EXCEL to do our calculations. This program has been

More information

9.8 Graphing Rational Functions

9.8 Graphing Rational Functions 9. Graphing Rational Functions Lets begin with a deinition. Deinition: Rational Function A rational unction is a unction o the orm P where P and Q are polynomials. Q An eample o a simple rational unction

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

An Efficient Configuration Methodology for Time-Division Multiplexed Single Resources

An Efficient Configuration Methodology for Time-Division Multiplexed Single Resources An Eicient Coniguration Methodology or Time-Division Multiplexed Single Resources Benny Akesson 1, Anna Minaeva 1, Přemysl Šůcha 1, Andrew Nelson 2 and Zdeněk Hanzálek 1 1 Czech Technical University in

More information

Classifier Evasion: Models and Open Problems

Classifier Evasion: Models and Open Problems Classiier Evasion: Models and Open Problems Blaine Nelson 1, Benjamin I. P. Rubinstein 2, Ling Huang 3, Anthony D. Joseph 1,3, and J. D. Tygar 1 1 UC Berkeley 2 Microsot Research 3 Intel Labs Berkeley

More information

A Proposed Approach for Solving Rough Bi-Level. Programming Problems by Genetic Algorithm

A Proposed Approach for Solving Rough Bi-Level. Programming Problems by Genetic Algorithm Int J Contemp Math Sciences, Vol 6, 0, no 0, 45 465 A Proposed Approach or Solving Rough Bi-Level Programming Problems by Genetic Algorithm M S Osman Department o Basic Science, Higher Technological Institute

More information

Understanding Signal to Noise Ratio and Noise Spectral Density in high speed data converters

Understanding Signal to Noise Ratio and Noise Spectral Density in high speed data converters Understanding Signal to Noise Ratio and Noise Spectral Density in high speed data converters TIPL 4703 Presented by Ken Chan Prepared by Ken Chan 1 Table o Contents What is SNR Deinition o SNR Components

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

MAPI Computer Vision. Multiple View Geometry

MAPI Computer Vision. Multiple View Geometry MAPI Computer Vision Multiple View Geometry Geometry o Multiple Views 2- and 3- view geometry p p Kpˆ [ K R t]p Geometry o Multiple Views 2- and 3- view geometry Epipolar Geometry The epipolar geometry

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2016 Admin Assignment 1 solutions will be posted after class. Assignment 2 is out: Due next Friday, but start early! Calculus and linear

More information

Course Content. What is an Outlier? Chapter 7 Objectives

Course Content. What is an Outlier? Chapter 7 Objectives Principles of Knowledge Discovery in Data Fall 2007 Chapter 7: Outlier Detection Dr. Osmar R. Zaïane University of Alberta Course Content Introduction to Data Mining Association Analysis Sequential Pattern

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Skill Sets Chapter 5 Functions

Skill Sets Chapter 5 Functions Skill Sets Chapter 5 Functions No. Skills Examples o questions involving the skills. Sketch the graph o the (Lecture Notes Example (b)) unction according to the g : x x x, domain. x, x - Students tend

More information

Chapter 3 Image Enhancement in the Spatial Domain

Chapter 3 Image Enhancement in the Spatial Domain Chapter 3 Image Enhancement in the Spatial Domain Yinghua He School o Computer Science and Technology Tianjin University Image enhancement approaches Spatial domain image plane itsel Spatial domain methods

More information

Scheduling in Multihop Wireless Networks without Back-pressure

Scheduling in Multihop Wireless Networks without Back-pressure Forty-Eighth Annual Allerton Conerence Allerton House, UIUC, Illinois, USA September 29 - October 1, 2010 Scheduling in Multihop Wireless Networks without Back-pressure Shihuan Liu, Eylem Ekici, and Lei

More information

Estimating the Free Region of a Sensor Node

Estimating the Free Region of a Sensor Node Estimating the Free Region of a Sensor Node Laxmi Gewali, Navin Rongratana, Jan B. Pedersen School of Computer Science, University of Nevada 4505 Maryland Parkway Las Vegas, NV, 89154, USA Abstract We

More information

Mobile Robot Static Path Planning Based on Genetic Simulated Annealing Algorithm

Mobile Robot Static Path Planning Based on Genetic Simulated Annealing Algorithm Mobile Robot Static Path Planning Based on Genetic Simulated Annealing Algorithm Wang Yan-ping 1, Wubing 2 1. School o Electric and Electronic Engineering, Shandong University o Technology, Zibo 255049,

More information

Small Moving Targets Detection Using Outlier Detection Algorithms

Small Moving Targets Detection Using Outlier Detection Algorithms Please verify that (1 all pages are present, (2 all figures are acceptable, (3 all fonts and special characters are correct, and (4 all text and figures fit within the Small Moving Targets Detection Using

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

Computer Data Analysis and Use of Excel

Computer Data Analysis and Use of Excel Computer Data Analysis and Use o Excel I. Theory In this lab we will use Microsot EXCEL to do our calculations and error analysis. This program was written primarily or use by the business community, so

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Piecewise polynomial interpolation

Piecewise polynomial interpolation Chapter 2 Piecewise polynomial interpolation In ection.6., and in Lab, we learned that it is not a good idea to interpolate unctions by a highorder polynomials at equally spaced points. However, it transpires

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Automated Modelization of Dynamic Systems

Automated Modelization of Dynamic Systems Automated Modelization o Dynamic Systems Ivan Perl, Aleksandr Penskoi ITMO University Saint-Petersburg, Russia ivan.perl, aleksandr.penskoi@corp.imo.ru Abstract Nowadays, dierent kinds o modelling settled

More information

Compressed Sensing Image Reconstruction Based on Discrete Shearlet Transform

Compressed Sensing Image Reconstruction Based on Discrete Shearlet Transform Sensors & Transducers 04 by IFSA Publishing, S. L. http://www.sensorsportal.com Compressed Sensing Image Reconstruction Based on Discrete Shearlet Transorm Shanshan Peng School o Inormation Science and

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Joint Congestion Control and Scheduling in Wireless Networks with Network Coding

Joint Congestion Control and Scheduling in Wireless Networks with Network Coding Title Joint Congestion Control and Scheduling in Wireless Networks with Network Coding Authors Hou, R; Wong Lui, KS; Li, J Citation IEEE Transactions on Vehicular Technology, 2014, v. 63 n. 7, p. 3304-3317

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Classification Method for Colored Natural Textures Using Gabor Filtering

Classification Method for Colored Natural Textures Using Gabor Filtering Classiication Method or Colored Natural Textures Using Gabor Filtering Leena Lepistö 1, Iivari Kunttu 1, Jorma Autio 2, and Ari Visa 1, 1 Tampere University o Technology Institute o Signal Processing P.

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Monocular 3D human pose estimation with a semi-supervised graph-based method

Monocular 3D human pose estimation with a semi-supervised graph-based method Monocular 3D human pose estimation with a semi-supervised graph-based method Mahdieh Abbasi Computer Vision and Systems Laboratory Department o Electrical Engineering and Computer Engineering Université

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017 CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start

More information

Global Constraints. Combinatorial Problem Solving (CPS) Enric Rodríguez-Carbonell (based on materials by Javier Larrosa) February 22, 2019

Global Constraints. Combinatorial Problem Solving (CPS) Enric Rodríguez-Carbonell (based on materials by Javier Larrosa) February 22, 2019 Global Constraints Combinatorial Problem Solving (CPS) Enric Rodríguez-Carbonell (based on materials by Javier Larrosa) February 22, 2019 Global Constraints Global constraints are classes o constraints

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Neighbourhood Operations

Neighbourhood Operations Neighbourhood Operations Neighbourhood operations simply operate on a larger neighbourhood o piels than point operations Origin Neighbourhoods are mostly a rectangle around a central piel Any size rectangle

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Querying Complex Spatio-Temporal Sequences in Human Motion Databases

Querying Complex Spatio-Temporal Sequences in Human Motion Databases Querying Complex Spatio-Temporal Sequences in Human Motion Databases Yueguo Chen #, Shouxu Jiang, Beng Chin Ooi #, Anthony KH Tung # # School o Computing, National University o Singapore Computing, Law

More information

Lagrangian relaxations for multiple network alignment

Lagrangian relaxations for multiple network alignment Noname manuscript No. (will be inserted by the editor) Lagrangian relaxations or multiple network alignment Eric Malmi Sanjay Chawla Aristides Gionis Received: date / Accepted: date Abstract We propose

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes Menelaos I. Karavelas joint work with Eleni Tzanaki University of Crete & FO.R.T.H. OrbiCG/ Workshop on Computational

More information

Some Inequalities Involving Fuzzy Complex Numbers

Some Inequalities Involving Fuzzy Complex Numbers Theory Applications o Mathematics & Computer Science 4 1 014 106 113 Some Inequalities Involving Fuzzy Complex Numbers Sanjib Kumar Datta a,, Tanmay Biswas b, Samten Tamang a a Department o Mathematics,

More information

Chapter 9: Outlier Analysis

Chapter 9: Outlier Analysis Chapter 9: Outlier Analysis Jilles Vreeken 8 Dec 2015 IRDM Chapter 9, overview 1. Basics & Motivation 2. Extreme Value Analysis 3. Probabilistic Methods 4. Cluster-based Methods 5. Distance-based Methods

More information

Classifier Evasion: Models and Open Problems

Classifier Evasion: Models and Open Problems . In Privacy and Security Issues in Data Mining and Machine Learning, eds. C. Dimitrakakis, et al. Springer, July 2011, pp. 92-98. Classiier Evasion: Models and Open Problems Blaine Nelson 1, Benjamin

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Using a Projected Subgradient Method to Solve a Constrained Optimization Problem for Separating an Arbitrary Set of Points into Uniform Segments

Using a Projected Subgradient Method to Solve a Constrained Optimization Problem for Separating an Arbitrary Set of Points into Uniform Segments Using a Projected Subgradient Method to Solve a Constrained Optimization Problem or Separating an Arbitrary Set o Points into Uniorm Segments Michael Johnson May 31, 2011 1 Background Inormation The Airborne

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2016 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,

More information

Identifying Representative Traffic Management Initiatives

Identifying Representative Traffic Management Initiatives Identiying Representative Traic Management Initiatives Alexander Estes Applied Mathematics & Statistics, and Scientiic Computation University o Maryland-College Park College Park, MD, United States o America

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Learning Implied Global Constraints

Learning Implied Global Constraints Learning Implied Global Constraints Christian Bessiere LIRMM-CNRS U. Montpellier, France bessiere@lirmm.r Remi Coletta LRD Montpellier, France coletta@l-rd.r Thierry Petit LINA-CNRS École des Mines de

More information

Reducing the Bandwidth of a Sparse Matrix with Tabu Search

Reducing the Bandwidth of a Sparse Matrix with Tabu Search Reducing the Bandwidth o a Sparse Matrix with Tabu Search Raael Martí a, Manuel Laguna b, Fred Glover b and Vicente Campos a a b Dpto. de Estadística e Investigación Operativa, Facultad de Matemáticas,

More information

Pebble Sets in Convex Polygons

Pebble Sets in Convex Polygons 2 1 Pebble Sets in Convex Polygons Kevin Iga, Randall Maddox June 15, 2005 Abstract Lukács and András posed the problem of showing the existence of a set of n 2 points in the interior of a convex n-gon

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Reflection and Refraction

Reflection and Refraction Relection and Reraction Object To determine ocal lengths o lenses and mirrors and to determine the index o reraction o glass. Apparatus Lenses, optical bench, mirrors, light source, screen, plastic or

More information

A Cylindrical Surface Model to Rectify the Bound Document Image

A Cylindrical Surface Model to Rectify the Bound Document Image A Cylindrical Surace Model to Rectiy the Bound Document Image Huaigu Cao, Xiaoqing Ding, Changsong Liu Department o Electronic Engineering, Tsinghua University State Key Laboratory o Intelligent Technology

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Generell Topologi. Richard Williamson. May 6, 2013

Generell Topologi. Richard Williamson. May 6, 2013 Generell Topologi Richard Williamson May 6, Thursday 7th January. Basis o a topological space generating a topology with a speciied basis standard topology on R examples Deinition.. Let (, O) be a topological

More information

Scalable Test Problems for Evolutionary Multi-Objective Optimization

Scalable Test Problems for Evolutionary Multi-Objective Optimization Scalable Test Problems or Evolutionary Multi-Objective Optimization Kalyanmoy Deb Kanpur Genetic Algorithms Laboratory Indian Institute o Technology Kanpur PIN 8 6, India deb@iitk.ac.in Lothar Thiele,

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 Asymptotics, Recurrence and Basic Algorithms 1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 1. O(logn) 2. O(n) 3. O(nlogn) 4. O(n 2 ) 5. O(2 n ) 2. [1 pt] What is the solution

More information

Research on Image Splicing Based on Weighted POISSON Fusion

Research on Image Splicing Based on Weighted POISSON Fusion Research on Image Splicing Based on Weighted POISSO Fusion Dan Li, Ling Yuan*, Song Hu, Zeqi Wang School o Computer Science & Technology HuaZhong University o Science & Technology Wuhan, 430074, China

More information

Automatic Video Segmentation for Czech TV Broadcast Transcription

Automatic Video Segmentation for Czech TV Broadcast Transcription Automatic Video Segmentation or Czech TV Broadcast Transcription Jose Chaloupka Laboratory o Computer Speech Processing, Institute o Inormation Technology and Electronics Technical University o Liberec

More information

Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Digital Image Processing. Image Enhancement in the Spatial Domain (Chapter 4)

Digital Image Processing. Image Enhancement in the Spatial Domain (Chapter 4) Digital Image Processing Image Enhancement in the Spatial Domain (Chapter 4) Objective The principal objective o enhancement is to process an images so that the result is more suitable than the original

More information

DETECTION OF CHANGES IN SURVEILLANCE VIDEOS. Longin Jan Latecki, Xiangdong Wen, and Nilesh Ghubade

DETECTION OF CHANGES IN SURVEILLANCE VIDEOS. Longin Jan Latecki, Xiangdong Wen, and Nilesh Ghubade DETECTION OF CHANGES IN SURVEILLANCE VIDEOS Longin Jan Latecki, Xiangdong Wen, and Nilesh Ghubade CIS Dept. Dept. of Mathematics CIS Dept. Temple University Temple University Temple University Philadelphia,

More information

ROBUST FACE DETECTION UNDER CHALLENGES OF ROTATION, POSE AND OCCLUSION

ROBUST FACE DETECTION UNDER CHALLENGES OF ROTATION, POSE AND OCCLUSION ROBUST FACE DETECTION UNDER CHALLENGES OF ROTATION, POSE AND OCCLUSION Phuong-Trinh Pham-Ngoc, Quang-Linh Huynh Department o Biomedical Engineering, Faculty o Applied Science, Hochiminh University o Technology,

More information

Semi-Supervised SVMs for Classification with Unknown Class Proportions and a Small Labeled Dataset

Semi-Supervised SVMs for Classification with Unknown Class Proportions and a Small Labeled Dataset Semi-Supervised SVMs or Classiication with Unknown Class Proportions and a Small Labeled Dataset S Sathiya Keerthi Yahoo! Labs Santa Clara, Caliornia, U.S.A selvarak@yahoo-inc.com Bigyan Bhar Indian Institute

More information

Motion based 3D Target Tracking with Interacting Multiple Linear Dynamic Models

Motion based 3D Target Tracking with Interacting Multiple Linear Dynamic Models Motion based 3D Target Tracking with Interacting Multiple Linear Dynamic Models Zhen Jia and Arjuna Balasuriya School o EEE, Nanyang Technological University, Singapore jiazhen@pmail.ntu.edu.sg, earjuna@ntu.edu.sg

More information

Dynamic Collision Detection

Dynamic Collision Detection Distance Computation Between Non-Convex Polyhedra June 17, 2002 Applications Dynamic Collision Detection Applications Dynamic Collision Detection Evaluating Safety Tolerances Applications Dynamic Collision

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Road Sign Analysis Using Multisensory Data

Road Sign Analysis Using Multisensory Data Road Sign Analysis Using Multisensory Data R.J. López-Sastre, S. Lauente-Arroyo, P. Gil-Jiménez, P. Siegmann, and S. Maldonado-Bascón University o Alcalá, Department o Signal Theory and Communications

More information

CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation. Daniel Huttenlocher CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Non-exhaustive, Overlapping k-means

Non-exhaustive, Overlapping k-means Non-exhaustive, Overlapping k-means J. J. Whang, I. S. Dhilon, and D. F. Gleich Teresa Lebair University of Maryland, Baltimore County October 29th, 2015 Teresa Lebair UMBC 1/38 Outline Introduction NEO-K-Means

More information

2. Recommended Design Flow

2. Recommended Design Flow 2. Recommended Design Flow This chapter describes the Altera-recommended design low or successully implementing external memory interaces in Altera devices. Altera recommends that you create an example

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Xavier: A Robot Navigation Architecture Based on Partially Observable Markov Decision Process Models

Xavier: A Robot Navigation Architecture Based on Partially Observable Markov Decision Process Models Xavier: A Robot Navigation Architecture Based on Partially Observable Markov Decision Process Models Sven Koenig and Reid G. Simmons Carnegie Mellon University School o Computer Science Pittsburgh, PA

More information

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves Machine Learning A 708.064 11W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence I [2 P] a) [1 P] Give an example for a probability distribution P (A, B, C) that disproves

More information

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Your Name: Your student id: Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Problem 1 [5+?]: Hypothesis Classes Problem 2 [8]: Losses and Risks Problem 3 [11]: Model Generation

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2018 Admin Assignment 2 is due Friday. Assignment 1 grades available? Midterm rooms are now booked. October 18 th at 6:30pm (BUCH A102

More information

A Novel Accurate Genetic Algorithm for Multivariable Systems

A Novel Accurate Genetic Algorithm for Multivariable Systems World Applied Sciences Journal 5 (): 137-14, 008 ISSN 1818-495 IDOSI Publications, 008 A Novel Accurate Genetic Algorithm or Multivariable Systems Abdorreza Alavi Gharahbagh and Vahid Abolghasemi Department

More information

Connected Components of Underlying Graphs of Halving Lines

Connected Components of Underlying Graphs of Halving Lines arxiv:1304.5658v1 [math.co] 20 Apr 2013 Connected Components of Underlying Graphs of Halving Lines Tanya Khovanova MIT November 5, 2018 Abstract Dai Yang MIT In this paper we discuss the connected components

More information