Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1
Spatial Outlier A spatial data point that is extreme relative to its neighbors 2
Outline Single-Attribute Spatial Outlier Detection Z-value approach Iterative Approach & Median Multi-Attribute Spatial Outlier Detection Region Outlier Detection & Tracking Conclusion 3
An Example of Spatial Outlier Spatial outlier: S, global outlier: G, L 4
5 Spatial Outlier Detection: Z s(x) approach θ σ µ > = s s x s x S Z ) ( ) ( = )) ( ( 1 ) ( ) ( ) ( y f k x f x S x N y Function: If Declare x as a spatial outlier
Evaluation of Statistical Assumption Distribution of traffic station attribute f(x) is normal S 1 x ) = f ( x ) y N ( x ( f ( y )) k ( ) Distribution of is normal too! 6
Outline Single-Attribute Spatial Outlier Detection Z-value approach Iterative & Median Approach Multi-Attribute Spatial Outlier Detection Region Outlier Detection & Tracking Conclusion 7
Motivation Number of neighbors: k=3 Expected outliers: S1, S2, S3 Outliers detected by traditional approaches: S1, E1, E2 Why inconsistent? An outlier may have negative impact on its nearby points 8
Motivation of Proposed Algorithms Objective Eliminate the negative impact of detected spatial outlier on its nearby points, for example: S1 Find spatial outliers that will be ignored by traditional algorithms, for example: S2 Solutions: Iterative algorithms Each iteration detect only one spatial outlier Before a new iteration, substitute the attribute value of the previous detected spatial outlier with the average attribute value of its neighbors Median algorithm Use Median to represent the average attribute value of neighbors 9
Iterative Z-value Algorithm In each iteration: Compute the standardized difference (Zvalue) for every point in the dataset: z i = d i σ µ The point with largest Z-value identified as a spatial outlier Substitute the attribute value of the previous detected spatial outlier with the average attribute value of its neighbors 10
In each iteration: Iterative Ratio Algorithm Compute the ratio of a point s attribute value and the average attribute value of its neighbors, (r-value), for every point The point with largest r-value identified as an outlier Substitute the attribute value of the previous detected spatial outlier with the average attribute value of its neighbors 11
Iterative Z-value v.s. Ratio Iterative Z-value Z(s1) = 1.7 Z(s2) = 1.732 S2 will be selected first Iterative Ratio Ratio(s1) = 10/1=10 Ratio(s2) = 170/2=8.5 S1 will be selected first 12
Median Algorithm Use median to represent the average attribute value of neighbors Median is a robust estimator for the center of a data set Compute Z-value for each point z i = d Select the points whose Z-value greater than threshold as spatial outliers i σ µ 13
Outline Single-Attribute Spatial Outlier Detection Multi-Attribute Spatial Outlier Detection Region Outlier Detection & Tracking Conclusions 14
Multivariate Spatial Outlier Transportation: Abnormal traffic sensor stations (volume, occupancy, speed) Astronomy : A star whose constituent different from neighboring stars Census A county whose race population dissimilar with neighboring counties Multivariate spatial outliers are not necessarily univariate spatial outliers Unusual combination of normal values may cause multivariate spatial outliers 15
Problem Formulation: Definitions A set of spatial points X = {x 1, x 2,.. x n } q measurements (attribute values) are made on the spatial object x, y denotes the vector of (y 1,y 2,,y q ) T NN k (x i ) denotes the k nearest spatial neighbors of X i An attribute function f : A map from X to R q (the q dimensional Euclidean space) y i =f(x i ) = (f 1 (x i ), f 2 (x i ),, f q (x i )) T = (y i1, y i2,, y iq ) T Neighborhood function g: A map from X to Rq such that the jth component of g(x), g j (x i ) returns a summary statistic of attribute values y j of all the spatial points inside NN k (x i ), for example, mean function Comparison function h: For example, h=f g or h=f/g 16
Mahalanobis distance A distance measure based on correlations between the variable D 2 t (x) = (X m t )T S -1 t (X m t ) D t is the generalized squared distance of each point from the t group S t represents the within-group covariance matrix m t is the vector of the means of the variables of the t group X is the vector containing the values of the variables at location x Superior to Euclidean distance because it considers the distribution of the points (correlations) 17
Mahalanobis Distance It takes into account not only the average value but also its variance and the covariance of the variables measured It accounts for ranges of acceptability (variance) between variables It compensates for interactions or dependencies (covariance) between variables If the variables are normally distributed they can be converted to probabilities using the x 2 density function Unit of variable has influence on the distance Each variable stardardized to mean of zero and vairance of one 18
Multivariate Spatial Outlier Detection q-dimensional vector h(x) follows a multivariate normal distribution with mean vector µ and variance-covariance vector Σ Mahalanobis distance d 2 (x) = (h(x)- µ) T Σ -1 (h(x)-µ) is distributed as χ 2 q, which is chi-square distribution with q degree of freedom The probability that h(x) satisfies (h(x)- µ) Σ -1 (h(x)- µ)> χ 2 q (α) is α For a threshold θ, if d 2 (x) > θ, x is a spatial outlier n n 1 1 µ = h( ) Σ [ ][ ] T s = h( xi ) µ s h( xi ) µ s n 1 n 1 i= 1 s x i i= 19
Experiment: Census Data Set 20
Experiment Result (Median Algorithm) 21
Experiment Result (Mean Algorithm) 22
Outline Single-Attribute Spatial Outlier Detection Multi-Attribute Spatial Outlier Detection Region Outlier Detection & Tracking Conclusions 23
Region Outlier What is region outlier A group of adjoining spatial points whose feature is inconsistent with that of their surrounding neighbors Characteristics of meteorological data Spatial region outliers are frequently associated with severe weather phenomena and climate patterns, e.g., hurricane, tornado Preferable to decompose the original observation into different scales and treat them separately 24
Propose Approach Three steps Transform original data into wavelet domain Reconstruct from wavelet domain with particular scales of interest Apply image segmentation to identify region outliers Track the movement of the region outlier 25
Wavelet Analysis Method Characteristics of Wavelet Analysis Analyze signal at different frequencies with different resolutions Provide frequency and location of a variation Data in different scale can be studies with different focus Effective to filter signal or split different scales of variation Linear time and space complexities Applications of Wavelet Analysis Signal processing, image processing, computer vision Data mining area clustering, classification, regression, and data visualization 26
Wavelet Analysis Method Continuous wavelet transform W ( n, s ) = N n: localization of the wavelet transform s: scale Ψ: wavelet function X i (i=0,n-1): a discrete signal Inverse wavelet transform / 2 δjδt = J Re alw xi j C ψ (0) = s δ ( i 1 * x ( i) ψ i = 0 s n ) δ t 1 ( n, s ) j 0 1 / 2 0 j C δ : a constant for each wavelet function J: maximum scale index Ψ 0 : normalized wavelet function 27
Mexican Hat Wavelet with Locations and Scales The variation exists on all scales Power of variation changes at different locations 28
Wavelet Analysis Method Two base functions for wavelet analysis Mexican hat base 2 ( 1) d ψ 0 ( η ) = ( e 2 τ ( 21 / 2 ) d η Morlet base η 2 / 2 ) ψ 0 ( η ) 1 / 4 0η = π e w e η 2 / 2 We choose Mexican hat base Capture both positive and negative variations as separate peaks in wavelet power Provide better localization (spatial resolution) 29
Image segmentation Image Segmentation Partitions an image into connected components Points in a specific component have uniform attribute values Segmentation Methods: Discontinuity based Segment according to abrupt change of color intensity Often used for edge linking and curve detection Similarity based Segment image to regions which have similar characteristics within the boundary For example, region growing and split-and-merging 30
Segmentation Algorithm Find the largest connected component Find a connected component S from the dataset Compare its size with previously detected component S Use S to record the largest one Repeat above steps until all points of the dataset have been processed Steps to extract S from data set Σ 1) Pick a point p0 from Σ, whose value is greater than θ and not processed yet. 2) Label p0 as processed, and add p0 and its unprocessed neighbors into a queue 3) Remove a point p in the queue, check if its degree of connection C(p, p 0 ) is greater than variation level λ. If true, the neighbors of p will be added into the queue and p marked as processed. 4) Repeat the marking process until the queue is empty 31
Segmentation Algorithm Input: Σ : a set of data points θ: threshold for the clip level λ: variation level Output: S: the largest connected component with value above θ Σ = Ø; while (Σ contains unlabeled points) s p 0 = pickoneunlabeledpoint(σ, θ); L(p 0 ) = '*'; /*labeling p 0 as processed*/ QUEUE = InsertQueue(QUEUE, p 0 ); /* insert p 0 into a Queue */ while ( not Empty(QUEUE) ) /*get an element from the head of QUEUE*/ p 0 = RemoveQueue(QUEUE); For each p that is adjacent to p 0 if ( L(p) <> '*' and C(p, p 0 ) 1-λ) QUEUE = InsertQueue(QUEUE, p); L(p) = 0 s; S' = { p:l(p)=`0 }; /* S' is a λ-connected component*/ if (S' has more points than S) S = S'; /* save the largest component to S */ return(s); 32
Global Weather Data Global data of water vapor Multiple-parameter data with resolution of 1 degree by 1 degree Covers whole earth and is updated 4 times a day 33
Mexican Hat Wavelet with Locations and Scales The variation exists on all scales Power of variation changes at different locations Mexican hat wavelet has a satisfactory localization resolution 34
Wavelet transform A high value does not necessarily correspond to a high wavelet power Wavelet power mainly represents the variation of the signal for a particular scale 35
Perform Wavelet Transform along X dimension (Latitude) Include only particular scales of interest (2 and 3) Two spatial outliers Over south America (Center at 27 S and 55 W): tropical storm Over Gulf of Mexican (Center at 27 N and 90 W): hurricane 36
The Problem of transforming along the Y-axis (longitude) Reveal more patterns than the reconstructed data from wavelet transform along X-axis (latitude) These patterns are caused by the normal variation along the longitude Y and are noises in most cases 37
Experiment: Image Segmentation Reconstruction of water vapor at 0Am on 9/18, 2003 with Hurricane Isabel identified Reconstruction of water vapor at 6Am on 9/18, 2003 with Hurricane Isabel identified 38
Experiment: Tracking Movement 12 consecutive detected Isabel regions in 3 days 6 hour interval between two adjacent regions Noisy data might exist due to other weather patterns or inappropriate segmentation parameters Isabel moves northwestward Trajectory of moving region with noisy data Trajectory of moving region with noisy data removed 39
Outline Single-Attribute Spatial Outlier Detection Multi-Attribute Spatial Outlier Detection Region Outlier Detection Conclusions 40
Summary Single Attribute Spatial Outlier Z-value, Iterative, Median Multi-Attribute Spatial Outlier Two multivariate spatial outlier detection algorithms based on difference or ratio. Order the degree of spatial outlier-ness w.r.t Mahalanobis distance Region Outlier Detection based on wavelet transform and image segmentation On-line processing approach to tracking movement of outlier region in a data stream 41
Future Directions Multi-attribute spatial-temporal outliers Region outlier in three dimensional space with multiple attributes Track multiple moving outlier regions Remove the limitation (assumption) of multivariate normal distribution Widely used informal method: box plot approach Investigate the issue of handling large diskresident data set Minimize the number of disk page reads or passes 42
Related Publications Related Publications C.T. Lu, D. Chen, Y. Kou, Algorithms for Spatial Outlier Detection, IEEE International Conference on Data Mining, 2003 C.T. Lu, D. Chen, Y. Kou, Detecting Spatial Outliers with Multiple Attribute, IEEE International Conference on Tools with Artificial Intelligence, 2003 J. Zhao, C.T. Lu, Y. Kou, Detecting Region Outliers in Meteorological Data, Proceedings of the 11th International Symposium on Advances in Geographic Information Systems, New Orleans, Louisiana, pp. 49-55, Nov. 7-8, 2003. 43
Links Mapview: http://europa.nvc.cs.vt.edu/~ctlu/project/mapview/index.htm Mapcube: http://europa.nvc.cs.vt.edu/~ctlu/project/mapcube/mapcube.htm 44
Q & A ctlu@vt.edu 45