Traffic Volume(Time v.s. Station)

Size: px

Start display at page:

Download "Traffic Volume(Time v.s. Station)"

Patricia Hamilton
6 years ago
Views:

1 Detecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota hone:

2 Outline Introduction Motivation General Denition of Spatial Outlier Related Work roposed Approach and Algorithm Evaluation of roposed Approach Conclusions

3 $ Spatial Data Mining Spatial Databases are too large to analyze manually NASA Earth Observation System(EOS) National Institute of Justice - Crime mapping Census Bureau, Dept. of Commence - Census Data Spatial Data Mining Discover frequent and interesting spatial patterns for post processing (knowledge discovery) attern examples: outliers, hot-spots, land-use classication 1

4 Spatial Outlier $ Denition A data point that is extreme relative to its neighbors Individual attribute value is not necessarily extreme in the total population, but is extreme in its adjacent area An example Item: alm Beach County, Neighbors(item) = Counties in Florida Source: 2

5 Application Domain I-35W North Bound Volume: the number of vehicles passing a station within 5 minutes Traffic Volume(Time v.s. Station) I35W Station ID (North Bound) Time 0 3

6 Application Domain Minneapolis-St. aul (Twin-Cities) Trac Data Set 930 detectors (stations) installed on major highways eriodical measuring attributes: volume, occupancy, speed Interesting spatial outliers - discontinuities { Assume smooth spatial attribute

7 $ Application Domain: Twin-Cities Trac Data Traffic Volume v.s. Time for Station 138 on 1/ Volume Time Figure 1: Station 138 on 1/ Traffic Volume v.s. Time for Station 139 on 1/ Volume Time Figure 2: Station 139 on 1/ Traffic Volume v.s. Time for Station 140 on 1/ Volume Time Figure 3: Station 140 on 1/

8 An Example of Spatial Outlier Spatial outlier: S, global outlier: G 8 7 G Original Data oints S Data oint Fitting Curve 6 Attribute Values Q D 2 1 L Location Z s(x) approach: S(x) = [f (x)? 1 k y2n(x) (f (y))] if Z s(x) = j S(x)? s j >, declare x as an outlier s 4 Outliers Detected Using Spatial Test S 3 Spatial Test Zs(x) Q Location 4

9 Evaluation of Statistical Assumption Distribution of trac station attribute f (x) looks normal 70 Volume Distribution at 10:00am on 1/15/ Number of Stations Volume Value Distribution of S(x) = [f (x)? 1 k looks normal too! y2n(x) (f (y))] Normalized Volume Difference over Spatial Neighborhood at 10:00am on 1/15/ Number of Stations Normalized Volume Difference over Spatial Neighborhood 5

10 General Denition of Spatial Outlier Given A spatial framework S consisting of locations s 1 ; s 2 ; : : : ; s n An attribute function f : s i! R (R : set of real numbers) A neighborhood relationship N S S A neighborhood aggregation function f N aggr : RN! R A dierence function F diff : R R! R A test procedure T F : R! ft rue; F alseg General denition: SO-outlier An object O 2 S is a SO-outlier (f, faggr, N F diff, T F ) if T F ff diff [f (x), faggr(f N (x), N (x))]g == TRUE Constraints F diff, T F are composed (e.g., ratio, product, linear combination) of few statistical function of f, f N aggr 6

11 $ Related Work - Outlier Detection Tests Outlier Detection Methods One-dimensional (linear) Multi-dimensional Frequency distribution over attribute value Homogeneous Dimensions Bi-partite dimension (Spatial outlier detection) Graphical Quantitative Varigram Cloud Moran Scatterplot Scatterplot Spatial Statistic Z s(x) 7

12 Related Work - Outlier Detection Tests Related work: two families 1-dimension - ignores geographic location Homogeneous Multi-dimension - mixes location with attributes Spatial outlier { 2 classes of dimensions - location, attributes { Neighborhood - based on location dimensions { Dierence - compares attribute dimensions Comparison of outlier detection methods One-dimensional ( linear ) Homogeneous Multi-dimensional Bi-partite Neighbor Definition N/A location and attribute location Comparison with population distribution location and attribute attribute values of neighbors 8

13 Related Work Dierent Spatial Outlier Tests Spatial Statistic Approach Scatter lot Approach (Luc Anselin 94) Moran Scatter plot Approach (Luc Anselin 95) Variogram Cloud Approach (Graphic) All these are special case of SO-outlier Will show this for one case Test for Outlier Detection Outlier Denition Dierence function F diff (f; f N aggr, f G2 aggr, : : :, f Gk aggr ) Test function T F (f; f N aggr, f G2 aggr, : : :, f Gk aggr, ) Spatial Statistic (Z s(x) ) Scatter lot Moran Scatter lot f(x)? S(x) = [f(x)? E(x)] = E(x)? (m f(x) + b) Z i = f, I i = (E(x) = 1 k f(y)) y2n(x) (E(x) = 1 k f(y)) f Z i y2n(x) m 2 j W ijz j, j S(x)? s s j > j? j > (Z i :+, I i : -) or (Z i :-, I i : +) Table 1: Test for Outlier Detection 9

14 $ Scatter lot Approach 7 Scatter lot Average Attribute Values Over Neighborhood Q S Attribute Values Lemma Scatter plot is a special case of SO-outlier Given An attribute function f (x) A neighborhood relationship N (x) An aggregation function f N aggr : E(x) = 1 k y2n(x) f (y) A dierence function F diff : = E(x)? (m f (x) + b) Detect spatial outlier by Test function T F : j? j > 10

15 $ Moran Scatterplot Approach X axis : Z i = f(x)? standardized deviation units Y axis (local moran value ): I i = Z i { where m 2 = Four clusters j Z2 j, j W ij = 1 m 2 j W ijz j { High-high: High attribute value, high attribute value neighbors (ositive association) { High-low: High attribute value, low attribute value neighbors (Negative association) { low-high: Low attribute value, high attribute values neighbors (Negative association) { low-low: Low attribute value, low attribute value neighbors (ositive association) Spatial outliers { High low; Low high Weighted Neighbor Z Score of Attribute Values Moran Scatter lot Q S Z Score of Attribute Values

16 Moran Scatter lot Approach Lemma Moran Scatter plot is a special case of SO-outlier Given An attribute function f (x) A neighborhood relationship N (x): neighbor matrix W ij Dierence functions Z i = f(x)? f f, I i = Z i m 2 j W ijz j Detect Spatial Outlier by Test function (Z i : +, I i : -) or (Z i : -, I i : +)

17 Variogram Cloud Approach Approach X axis: airwise distance between neighbors Y axis: Square root dierence of attribute value Spatial outlier { Locations that are near to one another { Locations with large attribute dierences An example { S1 : relation of S and Q node { S2 : relation of S and node Square Root of Absolute Difference of Attribute Values (Q,S) (,S) Variogram Cloud airwise Distance

18 $ Outline Introduction roposed Approach and Algorithm roblem formulation Our approach Ecient algorithm Cost model Evaluation of roposed Approach Conclusions

19 $ roblem Formulation Given A spatial framework S consisting of locations s 1 ; s 2 ; : : : ; s n An attribute function f : s i! R A neighborhood relationship N S S An aggregation function faggr N : RN! R A dierence function F diff : R R! R A test procedure T F : R! ft rue; F alseg General denition: SO-outlier An object O 2 S is a SO-outlier (f, faggr, N F diff, T F ) if T F ff diff [f (x), faggr(f N (x), N (x))]g == TRUE Design An ecient algorithm to detect SO-outier, i,e., O = fs i j s i 2 S; s i is a spatial outlierg Objective Eciency: to minimize the computation time Constraints F diff, T F are composed (e.g., ratio, product, linear combination) of few statistical function of f, f N aggr The size of the data set the main memory size Computation time is determined by I/O time 11

20 Our Approach Separate two phases Model Building Test a node (or a set of nodes) Computation Structure of Model Building Key insights: { Need a few statistics of f (x) and f N aggr Spatial self join using N (x) relationship Computation Structure of Testing Single node: spatial range query { Get All Neighbors(x) operation A given set of nodes { Sequence of Get All Neighbor(x) 12

21 An Example of Our Approach Consider Scatter lot Model Building Neighborhood aggregate function f N aggr : E(x) = 1 k Global aggregate function f G1 aggr, f G2 aggr,.., f Gk aggr y2n(x) f (y) { f (x); E(x); f (x)e(x); f 2 (x); E 2 (x) Global parameter G1 aggr, G2 aggr,.., Gk aggr { m = N f(x)e(x)? f(x) E(x) N f 2 (x)?( f(x) 2 f(x) E { b 2 (x)? f(x) f(x)e(x) = N f 2 (x)?( f(x)) q 2 S { yy?(m 2 S xx ) =, (n?2) { where S xx = { and S yy = f 2 (x)? [ ( f(x)) 2 E 2 (x)? [ ( E(x)) 2 n n ] ] Testing Dierence function { = E(x)? (m f (x) + b) { where E(x) = 1 k Test function { j? j > y2n(x) f (y) 13

22 Scatter lot X axis: Attribute value of each point Y axis: Average attribute value in adjacent area Run a linear square regression on the plot { Regression line y = mx + b { m = N x i y i? x i yi N x 2?( x i i ) 2 { b = xi yi 2? xi xi y i N x 2 i?( x i ) 2 { = ^y? (mx + b) { Standard deviation = { Mean = 0 { S xx = { S yy = x 2 i? [ ( x i ) 2 n y 2 i? [ ( y i ) 2 = ^y? (mx + b) n ] ] q S yy?(a 2 S xx ) (N?2) { if standardized residual j? j > { Declare x as an outlier

23 Spatial Outlier Detection: Summary Model Building Compute simple statistics on f (x); f N aggr Details in Table 2 Computable in a single spatial-self-join on N (x) Outlier Denition Attribute function f Neighborhood aggregate function Model Building Spatial Statistic Scatter lot Moran Scatter lot f(x) f(x) f(x) S(x) = f(x)? E(x) E(x) = 1 k y2n(x) f(y) f N aggr Global function faggr G1,.., faggr Gk Global parameter aggr G1,.., Gk aggr f G2 aggr, aggregate G2 aggr, S(x), S 2 (x) S(x) s =, n s = q 1 [ S 2 (x) ( S(x)) 2? ] n n f(x), f(x)e(x), E 2 (x) E(x), f 2 (x), m = N f(x)e(x)? f(x) E(x), N f 2 (x)?( f(x) 2 b = f(x) E 2 (x)? f(x) f(x)e(x), N f 2 (x)?( f(x)) 2 = q S yy?(m 2 S xx), where (n?2) S xx = f 2 (x)? [ ( f(x)) 2 n ], S yy = E 2 (x)? [ ( E(x)) 2 n ], f(x), f 2 (x) f(x) f =, n f = q 1 [ f 2 (x) ( f(x)) 2? ] n n Table 2: Model Building 14

24 Spatial Outlier Detection: Summary Test for Outlier Detection Outlier Denition Dierence function F diff (f; f N aggr, f G2 aggr, : : :, f Gk aggr ) Test function T F (f; f N aggr, f G2 aggr, : : :, f Gk aggr, ) Spatial Statistic (Z s(x) ) Scatter lot Moran Scatter lot f(x)? S(x) = [f(x)? E(x)] = E(x)? (m f(x) + b) Z i = f, I i = (E(x) = 1 k f(y)) y2n(x) (E(x) = 1 k f(y)) f Z i y2n(x) m 2 j W ijz j, j S(x)? s s j > j? j > (Z i :+, I i : -) or (Z i :-, I i : +) Table 3: Test for Outlier Detection 15

25 Model Building Algorithm Algorithm A1: steps For each location x { Retrieve data record of x = (f(x), identities of neighbors(x)) { Get-All-Neighbors(x): Retrieve data records of neighbor(x) if neighbor y is not in the memory buer request another I/O operation { Compute neighborhood aggregate function f N aggr { Accumulate global aggregate function: f G1 aggr, f G2 aggr,.., f Gk aggr Compute global parameter G1 aggr, G2 aggr,.., Gk aggr I/O cost is determined by Dominant operation: Get-All-Neighbors(x) I/O cost of Get-All-Neighbors(x) is determined by the clustering eciency { Grouping nodes into disk page 16

26 $ Test Algorithm Algorithm A2(a): steps For each location x along a route { Retrieve data record of x = (f(x), identies of neighbors(x)) { Get-All-Neighbors(x): Retrieve data records of neighbor(x) if neighbor y is not in the memory buer request another I/O operation { Compute Dierence function F diff (f; f N aggr, f G2 aggr, : : :, f Gk aggr ) if T F (f; f N aggr, f G2 aggr, : : :, f Gk aggr, ) == T rue { Declare x as an outlier Algorithm A2(b): steps For each location x within the random set { Retrieve data record of x = (f(x), identities of neighbors(x)) { Get-All-Neighbors(x): Retrieve data records of neighbor(x) if neighbor y is not in the memory buer request another I/O operation { Compute Dierence function F diff (f; f N aggr, f G2 aggr, : : :, f Gk aggr ) if T F (f; f N aggr, f G2 aggr, : : :, f Gk aggr, ) == T rue { Declare x as an outlier 17

27 I/O Cost Model Denition CE: Clustering Eciency N: Total number of nodes Bfr: Blocking factor (number of nodes in a page) K: Number of neighbors for each node L: Number of nodes in a route R: Number of nodes in a random set Assume two memory buers Cost model of A1 (N/Bfr)+N*K*(1-CE) { The cost to retrieve all nodes: (N/Bfr) { The cost to retrieve neighbors of all nodes: N*K*(1-CE) Cost model of A2(a) L*(1-CE) + L*K*(1-CE) Cost model of A2(b) R+R*K(1-CE) 18

28 Clustering Eciency arameter Computation cost (I/O cost) is determined by Clustering Eciency(CE) CE denition: { (Total number of unsplit edges)/(total number of edges) { robability[v i and a neighbor of v i are stored in the same disk page] An example { CE = 9?3 9 = 6 9 = 0:66 age B V1 V3 V2 V4 V5 V6 V7 V8 age A V9 age C CE depends on { Disk block size { Data set size, data set distribution { Clustering method 19

29 $ Outline Introduction roposed Approach and Algorithm Evaluation of roposed Approach Candidates (Clustering Methods) Experiment Design Results Conclusions

30 Experimental Evaluation (Summary) Hypothesis: I/O cost of the algorithms is determined by the clustering eciency hysical Data age Clustering Method Graph-based method: CCAM Geometric method: Cell Tree Geometric method: Z-order Metrics: Clustering Eciency (CE), I/O cost Benchmark data Minneapolis - St. aul Trac data Benchmark tasks Model Building Test Spatial Outlier (Route) Test Spatial Outlier (Set of Random Nodes) Variable arameters Buer size Disk page size 20

31 Clustering Method: CCAM Connectivity Clustered Access Method Cluster the nodes via min-cut graph partitioning Use B+ tree with Z-order as the secondary index Key 8 Key 0 Key 1 Key 3 Key 4 Key 5 Key 6 Key 7 Key 8 Node 0 Node 1 Node 4 Node 5 Node 3 Node 6 Node 8 Node B tree index Key 9 Key 11 Key 12 Key 13 Key 14 Key 15 Key 18 Key 26 Node 7 Node 12 Node 13 Node 18 Node 11 Node 14 Node 15 Node 26 Data age 21

32 Clustering Method: CCAM Connectivity Clustered Access Method Cluster the nodes via min-cut graph partitioning Use B+ tree with Z-order as the secondary index

33 Clustering Method: Cell Tree Binary Space artitioning(bs) Decompose universe into disjoint convex subspaces Each leaf node corresponds to one of the subspaces Each tree node is stored on one disk page Cannot exploit edge information, pure geometric H H H3 H1 H2 H3 Cell-tree Index Node 0 Node 1 Node 3 Node 6 Node 4 Node 5 Node 7 Node 18 Node 8 Node 9 Node 11 Node 14 Node 12 Node 13 Node 15 Node 26 Data age 22

34 Clustering Method: Cell Tree Binary Space artitioning(bs) Decompose universe into disjoint convex subspaces Each leaf node corresponds to one of the subspaces Each tree node is stored on one disk page Cannot exploit edge information, pure geometric H H H

35 Clustering Method: Z-order Impose a total order on the nodes Z order = interleave (bits of X, bits of Y) Use B+ tree as the primary index Cannot exploit edge information, pure geometric Key 8 B + tree index Key 0 Key 1 Key 3 Key 4 Key 5 Key 6 Key 7 Key 8 Key 9 Key 11 Key 12 Key 13 Key 14 Key 15 Key 18 Key 26 Node 0 Node 1 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 11 Node 12 Node 13 Node 14 Node 15 Node 18 Node 26 Data age 23

36 Clustering Method: Z-order Impose a total order on the nodes Z order = interleave (bits of X, bits of Y) Use B+ tree as the primary index Cannot exploit edge information, pure geometric

37 Experiment Design Experiment Data Set Twin-Cities Trac Data Each data object (node) { Attribute Values { Neighbor List { Size: 256 bytes CCAM Z-order Cell-tree age Size Buffer Size No of Neighbors Twin-Cities Highway Map Clustering method Sets of pages of data Sets of pages of data Global arameters Test Spatial Outlier (A2) (a) Set of random nodes (b) Nodes in a Route Buffer size No of neighbors Model Building (spatial self- join) I/O Cost Clustering Efficiency Figure 4: Experimental Layout 24

38 Experimental Observation and Results Step 1: Model Building Compute global aggregate function Compute global parameters Spatial self join structure Step 2: Test Spatial Outlier Detect outliers along a Route Detect outliers in a set of random nodes 25

39 Model Building: Eect of age Size Fixed arameters Buer Size: 64k Variable arameters age size, clustering strategy Number of page accesses Zord CELL CCAM CE Value Zord CELL CCAM age Size (K bytes) age Size (K Bytes) CCAM has the best performance CCAM has the highest CE value High CE => Low I/O cost { Cost Model: (N/Bfr)+N*K*(1-CE) Increase page size => reduce number of page accesses 26

40 Model Building: eect of Buer Size Fixed arameters age size: 2K Clustering Eciency: { CCAM=0.81, Cell=0.69, Z-ord=0.51 Variable arameters Number of buers, clustering strategy Number of page accesses Zord CELL CCAM Number of Buffers Increase Buer size => reduce number of page accesses CCAM has the best performance 27

41 $ Test Spatial Outlier (Route): Eect of age Size Average I/O cost of outlier query over 50 routes Fixed arameters Buer size: 4 Kbytes, Data point size = 256 bytes Variable arameters age size, clustering strategy Number of age Accesses Zord CELL CCAM CE Value Zord CELL CCAM age Size (K Bytes) age Size (K Bytes) Increase page size => reduce no of page accesses CCAM has the best performance CCAM has the highest CE value High CE => Low I/O cost { Cost Model: L*(1-CE) + L*K*(1-CE) Cell Tree has zero CE value when Bfr=2 Increase page size => erformance gap reduces 28

42 $ Test Spatial Outlier (Random Nodes): Eect of Buer Size Average I/O cost of outlier query over 50 sets of random nodes Fixed arameters Number of random nodes: 150, page size : 1 Kbytes CE value : CCAM = 0.68 Cell = 0.53 Z-ord = 0.31 Variable arameters Buer Size, clustering strategy Number of page accesses Zord CELL CCAM Number of Buffers Increase buer size => No eect CCAM has the best performance

43 $ Test Spatial Outlier (Random Nodes): Eect of age Size Average I/O cost over 50 sets of random nodes Fixed arameters Number of random nodes: 150 points, buer size: 4K Variable arameters age size, clustering strategy Number of age Accesses Zord CELL CCAM age Size (K Bytes) Increase page size => reduce no of page accesses CCAM has the best performance

44 $ Summary of Experimental Results Clustering Eciency CCAM achieves higher clustering eciency than Cell tree and Z-order Test parameter and test result computation CCAM has lower I/O than Cell tree and Z-order Higher CE leads to lower I/O cost CE is a good predicator of relative I/O performance age size improves clustering eciency of all methods Reduces performance gap between methods 29

45 Conclusion SO-outlier A general denition of spatial outlier Show that existing denition are special cases of SO-outlier Scatter plot, Moran Scatterplot, Spatial Statistic Develop ecient algorithm to detect spatial outlier Model Buliding Test Spatial Outlier Recognize the computation structure of spatial outlier detection algorithms Self Join Get-All-Neighbor() dominates I/O cost Develop Algebraic Cost Models Evaluate Alternate age Clustering Methods 30

46 Future Direction Extend Spatial Outlier Detection Test Multi attributes (non-location) Location attribute includes time Temporal and Spatial-Temporal Outliers Extend Experiments NASA data sets - uniform grid Hypothesis - Geometric clustering may perform well Explore other spatial patterns beyond spatial outlier Land-use classication 31

Unified approach to detecting spatial outliers Shashi Shekhar, Chang-Tien Lu And Pusheng Zhang. Pekka Maksimainen University of Helsinki 2007

Unified approach to detecting spatial outliers Shashi Shekhar, Chang-Tien Lu And Pusheng Zhang Pekka Maksimainen University of Helsinki 2007 Spatial outlier Outlier Inconsistent observation in data set