Detecting Graph-Based Spatial Outliers: Algorithms and Applications(A Summary of Results) Λ y

Size: px

Start display at page:

Download "Detecting Graph-Based Spatial Outliers: Algorithms and Applications(A Summary of Results) Λ y"

Annabella Skinner
5 years ago
Views:

1 Detectin Graph-Based Spatial Outliers: Alorithms and Applications(A Summary of Results) Λ y Shashi Shekhar Department of Computer Science and Enineerin University of Minnesota 2 Union ST SE, Minneapolis, MN shekhar@cs.umn.edu Chan-Tien Lu Department of Computer Science and Enineerin University of Minnesota 2 Union ST SE, Minneapolis, MN ctlu@cs.umn.edu Pushen Zhan Department of Computer Science and Enineerin University of Minnesota 2 Union ST SE, Minneapolis, MN pushen@cs.umn.edu ABSTRACT Identification of outliers can lead to the discovery of unexpected, interestin, and useful knowlede. Existin methods are desined for detectin spatial outliers in multidimensional eometric data sets, where a distance metric is available. In this paper, we focus on detectin spatial outliers in raph structured data sets. We define statistical tests, analyze the statistical foundation underlyin our approach, desin several fast alorithms to detect spatial outliers, and provide a cost model for outlier detection procedures. In addition, we provide experimental results from the application of our alorithms on a Minneapolis-St.Paul(Twin Cities) traffic dataset to show their effectiveness and usefulness. Keywords Outlier Detection, Spatial Data Minin, Spatial Graphs 1. INTRODUCTION Outliers have been informally defined as observations which appear to be inconsistent with the remainder of that set of data [2], or which deviate so much from other observations so as to arouse suspicions that they were enerated by a different mechanism [5]. The identification of outliers can lead to the discovery of unexpected knowlede and has a number of practical applications in areas such as credit card fraud, the performance analysis of athletes, votin irreularities, Λ This work is supported in part by the Army Hih Performance Computin Research Center under the auspices of the Department of the Army, Army Research Laboratory Cooperative areement number DAAH /contract number DAAH4-95-C-8, and by the National Science Foundation under rant y A full version paper is available at Permission to make diital or hard copies of all or part of this work for personal or classroom use is ranted without fee provided that copies are not made or distributed for profit or commercial advantae and that copies bear this notice and the full citation on the first pae. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD-2 San Francisco, CA, USA Copyriht 2 ACM X-XXXXX-XX-X/XX/XX...$5.. bankruptcy, and weather prediction. Outliers in a spatial data set can be classified into three cateories: set-based outliers, multi-dimensional space-based outliers, and raph-based outliers. A set-based outlier is a data object whose attributes are inconsistent with attribute values of other objects in a iven data set reardless of spatial relationships. Both multi-dimensional space-based outliers and raph-based outliers are spatial outliers, that is, data objects that are sinificantly different in attribute values from the collection of data objects amon spatial neihborhoods. However, multi-dimensional space-based outliers and raph-based outliers are based on different spatial neihborhood definitions. In multi-dimensional space-based outlier detection, the definition of spatial neihborhood is based on Euclidean distance, while in raph-based spatial outlier detections, the definition is based on raph connectivity. Many spatial outlier detection alorithms have been recently proposed; however, spatial outlier detection remains challenin for various reasons. First, the choice of a neihborhood is nontrivial. Second,the desin of statistical tests for spatial outliers needs to account for the distribution of the attribute values at various locations as well as the areate distribution of attribute values over the neihborhoods. In addition, the computation cost of determinin parameters for a neihborhood-based test can be hih due to the possibility of join computations. In this paper, we formulate a eneral framework for detectin outliers in spatial raph data sets, and propose an efficient raph-based outlier detection alorithm. We provide cost models for outlier detection queries, and compare underlyin data storae and clusterin methods that facilitate outlier query processin. We also use our basic alorithm to detect spatial outliers in a Minneapolis-St.Paul(Twin Cities) traffic data set, and show the correctness and effectiveness of our approach. 1.1 An Illustrative Application Domain In 1995, the University of Minnesota and the Traffic Manaement Center(TMC) Freeway Operations roup started the development of a database to archive sensor network measurements from the freeway system in the Twin Cities. The sensor network includes about nine hundred stations, each of which contains one to four loop detectors, dependin on the number of lanes. Sensors embedded in the freeways monitor the occupancy and volume of traffic on the road. At reular intervals, this information is sent to the Traffic Man-

2 aement Center for operational purposes, e.., ramp meter control, and research on traffic modelin and experiments. In this application, each station exhibits both raph and attribute properties. The topoloical space is the map, where each station represents a node and the connection between each station and its surroundin stations can be represented as an ede. The attribute space for each station is the traffic flow information (e.., volume, occupancy) stored in the value table. We are interested in discoverin the location of stations whose measurements are inconsistent with those of their raph-based spatial neihbors and the time periods when those abnormalities arise. This outlier detection task is to: ffl Build a statistical model for a spatial data set ffl Check whether a specific station is an outlier ffl Check whether stations on a route are outliers We use three neihborhood definitions in this application as shown in Fiure 1. First, we define a neihborhood based on spatial raph connectivity as a spatial raph neihborhood. In Fiure 1, (s 1;t 2) and (s 3;t 2) are the spatial neihbors of (s 2;t 2)ifs 1 and s 3 are connected to s 2 in a spatial raph. Second, we define a neihborhood based on time series as a temporal neihborhood. In Fiure 1, (s 2;t 1) and (s 2;t 3) are the temporal neihbors of (s 2;t 2)if t 1, t 2, and t 3 are consecutive time slots. In addition, we define a neihborhood based on both space and time series as a spatial-temporal neihborhood. In Fiure 1, (s 1;t 1), (s 1;t 2), (s 1;t 3), (s 2;t 1), (s 2;t 3), (s 3;t 1), (s 3;t 2), and (s 3;t 3) are the spatial-temporal neihbors of (s 2;t 2)ifs 1 and s 3 are connected to s 2 in a spatial raph, and t 1, t 2, and t 3 are consecutive time slots. Time t 3 t t 1 Spatial 2 Neihborhood Spatial-Temporal Neihborhood Temporal Neihborhood s s s Space Leend Spatial Neihbors Temporal Neihbors Additional Neihbors in Spatial-Temporal Neihborhood Window of Spatial-Temporal Neihborhood Fiure 1: Spatial and Temporal outlier in traffic data 1.2 Problem Formulation In this section, we formally define the spatial outlier detection problem. Given a spatial framework S for the underlyin spatial raph G, an attribute f over S, and neihborhood relationship R, we can build a model and construct statistical tests for spatial outliers based on a spatial raph accordin to the iven confidence level threshold. The problem is formally defined as follows. Spatial Outlier Detection Problem Given: ffl A spatial raph G = fs; E, where S is a spatial framework consistin of locations s 1;s 2;::: ;s n and E is a collection of edes between locations in S ffl A neihborhood relationship R consistent with E ffl An attribute function f: S! a set of real numbers ffl An areate function f ar: R N! a set of real numbers to summarize values of attribute f over a neihborhood relationship R N R ffl Confidence level threshold Find: ffl A set of spatial outliers Objective: ffl Correctness: outliers identified are sinificantly different from those of their neihborhood ffl Efficiency: to minimize the computation time Constraints: ffl Attribute values have a normal distribution ffl Size of the data set fl main memory size ffl The rane of attribute function f is the set of real numbers The formulation shows two subtasks in this spatial outlier detection problem: (a) the desin of a statistical model M and a test for spatial outliers (b) the desin of an efficient computation method to estimate parameters of the test, test whether a specific spatial location is an outlier, and test whether spatial locations on a iven path are outliers. 2. RELATED WORK, CONTRIBUTION Many outlier detection alorithms [1, 2, 3, 7, 8, 1, 12, 14] have been recently proposed. These methods can be broadly classified into two cateories, namely set-based outlier detection methods and spatial-set-based outlier detection methods. The set-based outlier detection alorithms [2, 6] consider the statistical distribution of attribute values, inorin the spatial relationships amon items. Numerous outlier detection tests, known as discordancy tests [2, 6], have been developed for different circumstances, dependin on the data distribution, the number of expected outliers, and the types of expected outliers. The main idea is to fit the data set to a known standard distribution, and develop a test based on distribution properties. Spatial-set-based outlier detection methods consider both attribute values and spatial relationships. They can be further rouped into two cateories, namely multi-dimensional metric space-based methods and raph-based methods. The multi-dimensional metric space-based methods model data sets as a collection of points in a multidimensional space, and provide tests based on concepts such as distance, density, and convex-hull depth. We discuss different example tests now. Knorr and N presented the notion of distancebased outliers [7, 8]. For a k dimensional data set T with N objects, an object O in T is a DB(p; D)-outlier if at least a fraction p of the objects in T lies reater than distance D from O. Ramaswamy et al. [11] proposed a formulation for distance-based outliers based on the distance of a point from its k th nearest neihbor. After rankin points by the distance to its k th nearest neihbor, the top n points are declared as outliers. Breuni et al. [3] introduced the notion of a local" outlier where the outlier-deree of an object is determined by takin into account the clusterin structure in a bounded neihborhood of the object, e.., k nearest neihbors. In computational eometry, some depth-based approaches [12, 1] oranize data objects in convex hull layers in data space accordin to their peelin depth [1], and

3 outliers are expected to be found from data objects with a shallow depth value. Yu et al. [14] introduced an outlier detection approach, called FindOut, which identifies outliers by removin clusters from the oriinal data. Methods for detectin outliers in multi-dimensional Euclidean space have several limitations. First, multi-dimensional approaches assume that the data items are embedded in a isometric metric space and do not capture the spatial raph structure. Secondly, they do not exploit apriori information about the statistical distribution of attribute data. Last, they seldom provide a confidence measure of the discovered outliers. In this paper, we formulate a eneral framework for detectin spatial outliers in a spatial data set with an underlyin raph structure. We define neihborhood-based statistics and validate the statistical distribution. We then desin a statistically correct test for discoverin spatial outliers, and develop a fast alorithm to estimate model parameters, as well as to determine the results of a spatial outlier test on a iven item. In addition, we evaluate our method in a Twin Cities traffic data set and show the effectiveness and usefulness of our approach. 3. OUR APPROACH In this section, we list the key desin decisions and propose an I/O efficient alorithm for spatial raph-based outliers. 3.1 Choice of Spatial Statistic For spatial statistics, several parameters should be predetermined before runnin the spatial outlier test. First, the neihborhood can be selected based on a fixed cardinality or a fixed raph distance or a fixed Euclidean distance. Second, the choice of neihborhood areate function can be mean, variance, or auto-correlation. Third, the choice for comparin a location with its neihbors can be either just a numberor avector of attribute values. Finally, the statistic for the base distribution can be selected from various choices. The statistic we used is S(x) =[f(x) E y2n(x) (f(y))], where f(x) is the attribute value for a data record x, N(x)is the fixed cardinality set of neihbors of x, and E y2n(x) (f(y)) is the averae attribute value for neihbors of x. Statistic S(x) denotes the difference of the attribute value of each data object x and the averae attribute value of x s neihbors. Lemma 1. Spatial Statistic S(x) =[f(x) E y2n(x) (f(y))] is normally distributed if attribute value f(x) is normally distributed. Proof: Given the definition of neihborhood, for each data record x, the averae attribute values E y2n(x) (f(y)) of x skneihbors can be calculated. Since attribute values f(x) are normally distributed and an averae of normal variables is also normally distributed, the averae attribute values E y2n(x) (f(y)) over neihbors is also a normal distribution for a fixed cardinality neihborhood. Since the attribute value and the averae attribute value over neihbors are two normal variable, the distribution of the difference of S(x) of each data object x and the averae attribute value of x s neihbors is also normally distributed. 3.2 Test for Outlier Detection The test for detectin an outlier can be described as j S(x) μs ff s j >. For each data object x with an attribute value f(x), the S(x) is the difference of the attribute value of data object x and the averae attribute value of its neihbors; μ s is the mean value of all S(x), and ff s is the standard deviation of all S(x). The choice of depends on the specified confidence interval. For example, a confidence interval of 95 percent will lead to ß Computation of Test Parameters We now propose an I/O efficient alorithm to calculate the test parameters, e.., mean and standard deviation for the statistics, as shown in Alorithm 1. The computed mean and standard deviation can then be used to detect the outlier in the incomin data set. Given an attribute data set V and the connectivity raph G, the Test Parameters Computation(TPC) alorithm first retrieves the neihbor nodes from G for each data object x. It then computes the difference of the attribute value of x and the averae of the attribute values of x s neihbor nodes. These different values are then stored as a set in the AvDist Set. Finally, the AvDist Set is used to et the distribution value μ s and ff s. Note that the data objects are processed on a pae basis to reduce redundant I/O. Test Parameters Computation(TPC) Alorithm Input: S is the attribute space; D is the attribute data set in S; F is the distance function in S; ND is the depth of neihbor; G =(D; E) is the spatial raph; Output: (μ s,ff s). for(i=1;i»jdj ;i++)f O i=get One Object(i,D); /* Select each object from D */ NNS=Find Neihbor Nodes Set(O i,nd,g); /* Find neihbor nodes of O i from G */ Accum Dist=; for(j=1;j» jnssj;j++)f O k =Get One Object(j,NNS); /* Select each object */ Accum Dist += F(O i;o k ;S) Av Dist = Accum Dist / jnnsj; Add Element(AvDist Set,i); /* Add the element */ μ s = Get Mean(AvDist Set); /* Compute Mean */ ff s = Get Standard Dev(AvDist Set); return (μ s,ff s). Alorithm 1: Pseudo-code for test parameters computation 3.4 Computation of Test Results The neihborhood areate statistics value, e.., mean and standard deviation, computed in the TPC alorithm can be used to verify the outliers in an incomin data set. The two verification procedures are Route Outlier Detection(ROD) and Random Node Verification(RNV). The ROD procedure detects the spatial outliers from a user specified route, as shown in Alorithm 2. The RNV procedure checks the outliers from a set of randomly enerated nodes. Given route RN in the data set D with raph structure G, the ROD alorithm first retrieves the neihborin nodes from G for each data object x in the route RN, then it computes the difference S(x) between the attribute value of x and the averae of attribute values of x s neihborin nodes. Each S(x) can then be tested usin the spatial outlier de-

4 tection test j S(x) μs ff s j >. The is predetermined by the iven confidence interval. The steps to detect outliers in both ROD and RNV are similar, except that the RNV has no shared data access needs across tests for different nodes. The I/O operations for Find Neihbor Nodes Set() in different iterations are independent of each other in RNV. We note that the operation Find Neihbor Nodes Set() is executed once in each iteration and dominates the I/O cost of the entire alorithm. The storae of the data set should support the I/O efficient computation of this operation. We discuss the choice for storae structure and provide an experimental comparison in Sections 5 and 6. Route Outlier Detection(ROD) Alorithm Input: S is the attribute space; D is the attribute data set in S; F is the distance function in S; ND is the depth of neihbor; G =(D; E) is the spatial raph; CI is the confidence interval; (μ s,ff s)are mean and standard deviation calculated in TPC; RN is the set of node in a route; Output: Outlier Set. for(i=1;i»jrnj ;i++)f O i=get One Object(i,D); /* Select each object from D */ NNS=Find Neihbor Nodes Set(O i,nd,g); /* Find neihbor nodes of O i from G */ Accum Dist=; for(j=1;j» jnssj;j++)f O k =Get One Object(j,NNS); /* Select each object */ Accum Dist += F(O i;o k ;S) AvDist = Accum Dist/jNNSj; T value = AvDist μs ffs /*Check the normal distribution table */ if( Check Normal Table(T value,ci)== True)f Add Element(Outlier Set,i); /* Add the element */ return Outlier Set. Alorithm 2: Pseudo-code for route outlier detection The I/O cost of ROD and RNV are also dominated by the I/O cost of Find Neihbor Nodes Set() operation. 4. ANALYTICAL EVALUATION In this section, we provide simple alebraic cost models for the I/O cost of outlier detection operations, usin the Connectivity Residue Ratio(CRR) measure of physical pae clusterin methods. The CRR value is determined by the pae clusterin method, the data record size, and the pae size. The CRR value is defined as follows. CRR = Symbol ff fi N L R Λ T otal number of unsplit edes Total numbe of edes Meanin The CRR value Averae blockin factor Total number ofnodes Number of nodes in a route Number of nodes in a random set Averae number of neihbors for each node Table 1: Symbols used in Cost Analysis Table 1 lists the symbols used to develop our cost formulas. ff is the CRR value. fi denotes the blockin factor, which is the number of data records that can be stored in one memory pae. Λ is the averae number of nodes in the neihbor list of a node. N is the total number of nodes in the data set, L is the number of nodes alon a route, and R is the number of nodes randomly enerated by users for spatial outlier verification. 4.1 Cost Modelin The TPC alorithm is a nest loop index join. Suppose that we use two memory buffers. If one memory buffer stores the data object x used in the outer loop and the other memory buffer is reserved for processin the neihbors of x, we et the followin cost function to estimate the number number of pae accesses. C TPC = N fi + N Λ Λ Λ (1 ff) The outer loop retrieves all the data records on a pae basis, and has an areated cost of N fi. For each nodex, on averae, ff Λ Λ neihbors are in the same pae as x, and can be processed without redundant I/O. Additional data pae accesses are needed to retrieve the other (1 ff)λλ neihbors, and it takes at most (1 ff) Λ Λ data pae accesses. Thus the expected total cost for the inner loop is N Λ Λ Λ (1 ff). In a similar way, a cost model for ROD alorithms can be derived as follows: C ROD = L Λ (1 ff)+lλλλ (1 ff) = L Λ (1 ff) Λ (1 + Λ); and a cost model for RNV alorithms can be derived as follows: C RNV = R + R Λ Λ Λ (1 ff) 5. EXPERIMENT DESIGN In this section, we describe the layout of our experiments and then illustrate the candidate clusterin methods. 5.1 Experimental Layout The desin of our experiments is shown in Fiure 2. Usin the Twin Cities Hihway Connectivity Graph(TCHCG), we took data from the TCHCG and physically stored the data set into data paes usin different clusterin strateies and pae sizes. These data paes were then processed to enerate the lobal distribution or samplin distribution, dependin on the size of the data sets. We compared different data pae clusterin schemes: [13], Z-orderin [9], and Cell-tree [4]. Other parameters of interest were the size of the memory buffer, the bufferin strateies, the memory block size(pae size), and the number of neihbors. The measures of our experiments were the CRR values and I/O cost for each outlier detection procedure. Twin-Cities Hihway Connectivity Graph Buffer size No of neihbors Bufferin stratey Z-order Cell-tree Pae Size Clusterin method Sets of paes of data Sets of paes of data Test Parameters Computation (TPC) (Nest loop index join) Test Parameters Buffer No of Bufferin Size Neihbors Stratey Route Outlier Detection(ROD). Random Node Verification (RNV). CRR I/O Cost Fiure 2: Experimental Layout

5 The experiments were conducted on many raphs. We present the results on a representative raph, which is a spatial network with 99 nodes that represents the traffic detector stations for a 2-square-mile section of the Twin Cities area. This data set is provided by the Minnesota Dept. of Transportation(MnDot). We used a common record type for all the clusterin methods. Each record contains a node and its neihbor-list, i.e., a successor-list and a predecessor-list. We also conducted performance comparisons of the I/O cost for outlier-detection query processin. 5.2 Candidate Clusterin Methods In this section we describe the candidate clusterin methods used in the experiments. Connectivity-Clustered Access Method(): [13] clusters the nodes of the raph via raph partitionin, e.., Metis. Other raph-partitionin methods can also be used as the basis of our scheme. In addition, an auxiliary secondary index is used to support query operations. The choice of a secondary index can be tailored to the application. We used the B + tree with Z-order in our experiments, since the benchmark raph was embedded in raphical space. Other access methods such as the R-tree and Grid File can alternatively be created on top of the data file, as secondary indices in to suit the application. Linear Clusterin by Z-order: Z-order [9] utilizes spatial information while imposin a total order on the points. The Z-order of a coordinate (x,y) is computed by interweavin the bits in the binary representation of the two values. Alternatively, Hilbert orderin may be used. A conventional one-dimensional primary index (e.. B + -tree) can be used to facilitate the search. Cell Tree: A cell tree [4] is a heiht-balanced tree. Each cell tree node corresponds, not necessarily to a rectanular box, but to a convex polyhedron. A cell tree restricts polyhedra to partitions of a BSP(Binary Space Partitionin), to avoid overlaps amon siblin polyhedra. Each cell tree node corresponds to one disk space, and the leaf nodes contain all the information required to answer a iven search query. The cell tree can be viewed as a combination of a BSPand R + -tree, or as a BSP-tree mapped on paed secondary memory. 6. EXPERIMENTAL RESULTS In this section, we illustrate the outlier examples detected in the traffic data set, present the results of our experiments, and test the effectiveness of the different pae clusterin methods. To simplify the comparison, the I/O cost represents the number of data paes accessed. This represents the relative performance of the various methods for very lare databases. For smaller databases, the I/O cost associated with the indices should be measured. Here we present the evaluation of I/O cost for the TPC alorithm. The evaluations of I/O cost for the RNV and ROD alorithms are available in the full version paper. 6.1 Outliers Detected We tested the effectiveness of our alorithm on the Twin Cities traffic data set and detected numerous outliers, as described in the followin examples. Fiure 3 shows one example of traffic flow outliers. Fiures 3(a) and (b) are the traffic volume maps for I-35W North Bound and South Bound, respectively, on 1/21/1997. The X-axis is a 5-minute time slot for the whole day and the Y-axis is the label of the stations installed on the hihway, startin from 1onthe north end to 61 on the south end. The abnormal white line at 2:45pm and the white rectanle from 8:2am to 1:am on the X-axis and between stations 29 to 34 on the Y-axis can be easily observed from both (a) and (b). The white line at 2:45pm is an instance of temporal outliers, where the white rectanle is a spatial-temporal outlier. Moreover, station 9 in Fiure 3(a) exhibits inconsistent traffic flow compared with its neihborin stations, and was detected as a spatial outlier. I35W Station ID(North Bound) Averae Traffic Volume(Time v.s. Station) Time (a) I-35W North Bound I35W Station ID(South Bound) Averae Traffic Volume(Time v.s. Station) Time (b) I-35W South Bound Fiure 3: An example of outliers 6.2 Evaluation of I/O cost for TPC alorithm In this section, we present the results of our evaluation of the I/O cost and CRR value for alternative clusterin methods while computin the test parameters. The parameters of interest are buffer size, pae size, number of neihbors, and neihborhood depth The effect of pae size and CRR value Fiures 4 (a) and (b) show the number of data paes accessed and the CRR values respectively, for different pae clusterin methods as the pae sizes chane. The buffer size is fixed at 32 Kbytes. As can be seen, a hiher CRR value implies a lower number of data pae accesses, as predicted in the cost model. outperforms the other competitors for all four pae sizes, and has better performance than Z-order clusterin The effect of neihborhood cardinality We evaluated the effect of varyin the number of neihbors and the depth of neihbors for different pae clusterin methods. We fixed the pae size at 1K, and the buffer size at 4K, and used the LRU bufferin stratey. Fiure 5 shows the number of pae accesses as the number of neihbors for each node increases from 2 to 1. has better performance than Z-order and. The performance rankin for each pae clusterin method remains the same for different numbers of neihbors. Fiure 5 shows the number of pae accesses as the neihborhood depth increases from 1 to 5. has better performance than Z-order and for all the neihborhood depths

6 Number of pae accesses Pae Size (K bytes) (a) Pae Accesses CRR Value Pae Size (K Bytes) (b) CRR Fiure 4: Effect of pae size on data pae accesses and CRR (buffer size = 32K) Number of pae accesses No of Neihbors (a) Number of Neihbors Number of pae accesses (b) Depth Neihborhood Depth Neihborhood Fiure 5: Effect of neihborhood cardinality on data pae accesses (Pae size = 1K, Buffer size = 4K) 7. CONCLUSIONS In this paper, we focused on detectin outliers in spatial raph data sets. We proposed the notion of a neihbor outlier in raph structured data sets, desined a fast alorithm to detect outliers, analyzed the statistical foundation underlyin our approach, provided the cost models for different outlier detection procedures, and compared the performance of our approach usin different data clusterin approaches. In addition, we provided experimental results from the application of our alorithm on Twin Cities traffic archival to show its effectiveness and usefulness. We have evaluated alternative clusterin methods for neihbor outlier query processin, includin model construction, random node verification, and route outlier detection. Our experimental results show that the, which achieves the hihest CRR, provides the best overall performance. 9. REFERENCES [1] M. Ankerst, M. Breuni, H. Krieel, and J. Sander. Optics: Orderin points to identify the clusterin structure. In Proceedins of the 1999 ACM SIGMOD International Conference on Manaement of Data, Philadephia, Pennsylvania, USA, paes 49 6, [2] V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley, New York, 3rd edition, [3] M. Breuni, H. Krieel, R. T. N, and J. Sander. Optics-of: Identifyin local outliers. In Proc. of PKDD '99, Praue, Czech Republic, Lecture Notes in Computer Science (LNAI 174), pp , Spriner Verla, [4] O. Gunther. The Desin of the Cell Tree: An Object-Oriented Index Structure for Geometric Databases. In Proc. 5th Intl. Conference on Data Enineerin, Feb [5] D. Hawkins. Identification of Outliers. Chapman and Hall, 198. [6] R. Johnson. Applied Multivariate Statistical Analysis. Prentice Hall, [7] E. Knorr and R. N. A unified notion of outliers: Properties and computation. In Proc. of the International Conference on Knowlede Discovery and Data Minin, paes , [8] E. Knorr and R. N. Alorithms for minin distance-based outliers in lare datasets. In Proc. 24th VLDB Conference, [9] A. Orenstein and T. Merrett. A Class of Data Structures for Associative Searchin. In Proc. Symp. on Principles of Database Systens, paes , [1] F. Preparata and M. Shamos. Computatinal Geometry: An Introduction. Spriner Verla, [11] S. Ramaswamy, R. Rastoni, and K. Shim. Efficient alorithms for minin outliers from lare data sets. In Proceedins of ACM SIGMOD International Conference on Manaement of Data, Dallas, Texas, 2. [12] I. Ruts and P. Rousseeuw. Computin depth contours of bivariate point clouds. In Computational Statistics and Data Analysis, 23: , [13] S. Shekhar and D.-R. Liu. : A connectivity-clustered Access Method for Areate Queries on Transportation Networks-A Summary of Results. IEEE Trans. on Knowlede and Data En., 9(1), January [14] D. Yu, G. Sheikholeslami, and A. Zhan. Findout: Findin outliers in very lare datasets. In Department of Computer Science and Enineerin State University of New York at Buffalo Buffalo, Technical report 99-3, ACKNOWLEDGMENT We are particularly rateful to Professor Vipin Kumar, and our Spatial Database Group members, Weili Wu, Yan Huan, Xiaobin Ma,and Hui Xion for their helpful comments and valuable discussions. We would also like to express our thanks to Kim Koffolt for improvin the readability and technical accuracy of this paper.

Traffic Volume(Time v.s. Station)

Traffic Volume(Time v.s. Station) Detecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu hone: 612-378-7705 Outline Introduction Motivation