Spatial Outlier Detection

Similar documents
Information Sciences Manuscript Draft. Title: Detecting and Tracking Region Outliers in Meteorological Data Sequences

Detecting and tracking regional outliers in meteorological data

Traffic Volume(Time v.s. Station)

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Network Traffic Measurements and Analysis

Edge and local feature detection - 2. Importance of edge detection in computer vision

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Sensor Tasking and Control

Introduction to Medical Imaging (5XSA0) Module 5

Unified approach to detecting spatial outliers Shashi Shekhar, Chang-Tien Lu And Pusheng Zhang. Pekka Maksimainen University of Helsinki 2007

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Robotics Programming Laboratory

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Part 3: Image Processing

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

3. Data Preprocessing. 3.1 Introduction

DATA MINING II - 1DL460

Filtering Images. Contents

2. Data Preprocessing

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Clustering in Ratemaking: Applications in Territories Clustering

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Feature Detectors and Descriptors: Corners, Lines, etc.

Elemental Set Methods. David Banks Duke University

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Clustering Part 4 DBSCAN

Supervised vs. Unsupervised Learning

Mobility Data Management & Exploration

INF 4300 Classification III Anne Solberg The agenda today:

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Mixture Models and EM

Clustering. Chapter 10 in Introduction to statistical learning

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Segmentation and Grouping

Supplementary Figure 1. Decoding results broken down for different ROIs

COMPUTER AND ROBOT VISION

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Spatial Interpolation & Geostatistics

University of Florida CISE department Gator Engineering. Clustering Part 4

Instance-based Learning

Region-based Segmentation

Uncertainties: Representation and Propagation & Line Extraction from Range data

Spatial Interpolation - Geostatistics 4/3/2018

Clustering & Classification (chapter 15)

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis

CS 490: Computer Vision Image Segmentation: Thresholding. Fall 2015 Dr. Michael J. Reale

Processing and Others. Xiaojun Qi -- REU Site Program in CVMA

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Mining and Analytics. Introduction

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Lecture 8 Object Descriptors

3. Data Structures for Image Analysis L AK S H M O U. E D U

Histograms. h(r k ) = n k. p(r k )= n k /NM. Histogram: number of times intensity level rk appears in the image

ECLT 5810 Data Preprocessing. Prof. Wai Lam

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

Chapter 5: Outlier Detection

CS 664 Segmentation. Daniel Huttenlocher

CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs

Overview of Clustering

MULTIVIEW REPRESENTATION OF 3D OBJECTS OF A SCENE USING VIDEO SEQUENCES

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Multiple Model Estimation : The EM Algorithm & Applications

7.1 INTRODUCTION Wavelet Transform is a popular multiresolution analysis tool in image processing and

Course Content. What is an Outlier? Chapter 7 Objectives

Clustering and Visualisation of Data

Supervised vs unsupervised clustering

Image Segmentation for Image Object Extraction

Note Set 4: Finite Mixture Models and the EM Algorithm

Exploratory data analysis for microarrays

Outline. Advanced Digital Image Processing and Others. Importance of Segmentation (Cont.) Importance of Segmentation

9.1. K-means Clustering

Data fusion and multi-cue data matching using diffusion maps

Clustering. Supervised vs. Unsupervised Learning

Background Subtraction based on Cooccurrence of Image Variations

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Ulrik Söderström 16 Feb Image Processing. Segmentation

Probabilistic and Statistical Models for Outlier Detection

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

The Curse of Dimensionality

IBL and clustering. Relationship of IBL with CBR

Machine Learning Classifiers and Boosting

Computer Vision Grouping and Segmentation. Grouping and Segmentation

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Introduction to Trajectory Clustering. By YONGLI ZHANG

Processing of binary images

A DATA DRIVEN METHOD FOR FLAT ROOF BUILDING RECONSTRUCTION FROM LiDAR POINT CLOUDS

Image Segmentation. Selim Aksoy. Bilkent University

Image Segmentation. Selim Aksoy. Bilkent University

Computer Vision 2. SS 18 Dr. Benjamin Guthier Professur für Bildverarbeitung. Computer Vision 2 Dr. Benjamin Guthier

Using Machine Learning to Optimize Storage Systems

How to Price a House

Transcription:

Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1

Spatial Outlier A spatial data point that is extreme relative to its neighbors 2

Outline Single-Attribute Spatial Outlier Detection Z-value approach Iterative Approach & Median Multi-Attribute Spatial Outlier Detection Region Outlier Detection & Tracking Conclusion 3

An Example of Spatial Outlier Spatial outlier: S, global outlier: G, L 4

5 Spatial Outlier Detection: Z s(x) approach θ σ µ > = s s x s x S Z ) ( ) ( = )) ( ( 1 ) ( ) ( ) ( y f k x f x S x N y Function: If Declare x as a spatial outlier

Evaluation of Statistical Assumption Distribution of traffic station attribute f(x) is normal S 1 x ) = f ( x ) y N ( x ( f ( y )) k ( ) Distribution of is normal too! 6

Outline Single-Attribute Spatial Outlier Detection Z-value approach Iterative & Median Approach Multi-Attribute Spatial Outlier Detection Region Outlier Detection & Tracking Conclusion 7

Motivation Number of neighbors: k=3 Expected outliers: S1, S2, S3 Outliers detected by traditional approaches: S1, E1, E2 Why inconsistent? An outlier may have negative impact on its nearby points 8

Motivation of Proposed Algorithms Objective Eliminate the negative impact of detected spatial outlier on its nearby points, for example: S1 Find spatial outliers that will be ignored by traditional algorithms, for example: S2 Solutions: Iterative algorithms Each iteration detect only one spatial outlier Before a new iteration, substitute the attribute value of the previous detected spatial outlier with the average attribute value of its neighbors Median algorithm Use Median to represent the average attribute value of neighbors 9

Iterative Z-value Algorithm In each iteration: Compute the standardized difference (Zvalue) for every point in the dataset: z i = d i σ µ The point with largest Z-value identified as a spatial outlier Substitute the attribute value of the previous detected spatial outlier with the average attribute value of its neighbors 10

In each iteration: Iterative Ratio Algorithm Compute the ratio of a point s attribute value and the average attribute value of its neighbors, (r-value), for every point The point with largest r-value identified as an outlier Substitute the attribute value of the previous detected spatial outlier with the average attribute value of its neighbors 11

Iterative Z-value v.s. Ratio Iterative Z-value Z(s1) = 1.7 Z(s2) = 1.732 S2 will be selected first Iterative Ratio Ratio(s1) = 10/1=10 Ratio(s2) = 170/2=8.5 S1 will be selected first 12

Median Algorithm Use median to represent the average attribute value of neighbors Median is a robust estimator for the center of a data set Compute Z-value for each point z i = d Select the points whose Z-value greater than threshold as spatial outliers i σ µ 13

Outline Single-Attribute Spatial Outlier Detection Multi-Attribute Spatial Outlier Detection Region Outlier Detection & Tracking Conclusions 14

Multivariate Spatial Outlier Transportation: Abnormal traffic sensor stations (volume, occupancy, speed) Astronomy : A star whose constituent different from neighboring stars Census A county whose race population dissimilar with neighboring counties Multivariate spatial outliers are not necessarily univariate spatial outliers Unusual combination of normal values may cause multivariate spatial outliers 15

Problem Formulation: Definitions A set of spatial points X = {x 1, x 2,.. x n } q measurements (attribute values) are made on the spatial object x, y denotes the vector of (y 1,y 2,,y q ) T NN k (x i ) denotes the k nearest spatial neighbors of X i An attribute function f : A map from X to R q (the q dimensional Euclidean space) y i =f(x i ) = (f 1 (x i ), f 2 (x i ),, f q (x i )) T = (y i1, y i2,, y iq ) T Neighborhood function g: A map from X to Rq such that the jth component of g(x), g j (x i ) returns a summary statistic of attribute values y j of all the spatial points inside NN k (x i ), for example, mean function Comparison function h: For example, h=f g or h=f/g 16

Mahalanobis distance A distance measure based on correlations between the variable D 2 t (x) = (X m t )T S -1 t (X m t ) D t is the generalized squared distance of each point from the t group S t represents the within-group covariance matrix m t is the vector of the means of the variables of the t group X is the vector containing the values of the variables at location x Superior to Euclidean distance because it considers the distribution of the points (correlations) 17

Mahalanobis Distance It takes into account not only the average value but also its variance and the covariance of the variables measured It accounts for ranges of acceptability (variance) between variables It compensates for interactions or dependencies (covariance) between variables If the variables are normally distributed they can be converted to probabilities using the x 2 density function Unit of variable has influence on the distance Each variable stardardized to mean of zero and vairance of one 18

Multivariate Spatial Outlier Detection q-dimensional vector h(x) follows a multivariate normal distribution with mean vector µ and variance-covariance vector Σ Mahalanobis distance d 2 (x) = (h(x)- µ) T Σ -1 (h(x)-µ) is distributed as χ 2 q, which is chi-square distribution with q degree of freedom The probability that h(x) satisfies (h(x)- µ) Σ -1 (h(x)- µ)> χ 2 q (α) is α For a threshold θ, if d 2 (x) > θ, x is a spatial outlier n n 1 1 µ = h( ) Σ [ ][ ] T s = h( xi ) µ s h( xi ) µ s n 1 n 1 i= 1 s x i i= 19

Experiment: Census Data Set 20

Experiment Result (Median Algorithm) 21

Experiment Result (Mean Algorithm) 22

Outline Single-Attribute Spatial Outlier Detection Multi-Attribute Spatial Outlier Detection Region Outlier Detection & Tracking Conclusions 23

Region Outlier What is region outlier A group of adjoining spatial points whose feature is inconsistent with that of their surrounding neighbors Characteristics of meteorological data Spatial region outliers are frequently associated with severe weather phenomena and climate patterns, e.g., hurricane, tornado Preferable to decompose the original observation into different scales and treat them separately 24

Propose Approach Three steps Transform original data into wavelet domain Reconstruct from wavelet domain with particular scales of interest Apply image segmentation to identify region outliers Track the movement of the region outlier 25

Wavelet Analysis Method Characteristics of Wavelet Analysis Analyze signal at different frequencies with different resolutions Provide frequency and location of a variation Data in different scale can be studies with different focus Effective to filter signal or split different scales of variation Linear time and space complexities Applications of Wavelet Analysis Signal processing, image processing, computer vision Data mining area clustering, classification, regression, and data visualization 26

Wavelet Analysis Method Continuous wavelet transform W ( n, s ) = N n: localization of the wavelet transform s: scale Ψ: wavelet function X i (i=0,n-1): a discrete signal Inverse wavelet transform / 2 δjδt = J Re alw xi j C ψ (0) = s δ ( i 1 * x ( i) ψ i = 0 s n ) δ t 1 ( n, s ) j 0 1 / 2 0 j C δ : a constant for each wavelet function J: maximum scale index Ψ 0 : normalized wavelet function 27

Mexican Hat Wavelet with Locations and Scales The variation exists on all scales Power of variation changes at different locations 28

Wavelet Analysis Method Two base functions for wavelet analysis Mexican hat base 2 ( 1) d ψ 0 ( η ) = ( e 2 τ ( 21 / 2 ) d η Morlet base η 2 / 2 ) ψ 0 ( η ) 1 / 4 0η = π e w e η 2 / 2 We choose Mexican hat base Capture both positive and negative variations as separate peaks in wavelet power Provide better localization (spatial resolution) 29

Image segmentation Image Segmentation Partitions an image into connected components Points in a specific component have uniform attribute values Segmentation Methods: Discontinuity based Segment according to abrupt change of color intensity Often used for edge linking and curve detection Similarity based Segment image to regions which have similar characteristics within the boundary For example, region growing and split-and-merging 30

Segmentation Algorithm Find the largest connected component Find a connected component S from the dataset Compare its size with previously detected component S Use S to record the largest one Repeat above steps until all points of the dataset have been processed Steps to extract S from data set Σ 1) Pick a point p0 from Σ, whose value is greater than θ and not processed yet. 2) Label p0 as processed, and add p0 and its unprocessed neighbors into a queue 3) Remove a point p in the queue, check if its degree of connection C(p, p 0 ) is greater than variation level λ. If true, the neighbors of p will be added into the queue and p marked as processed. 4) Repeat the marking process until the queue is empty 31

Segmentation Algorithm Input: Σ : a set of data points θ: threshold for the clip level λ: variation level Output: S: the largest connected component with value above θ Σ = Ø; while (Σ contains unlabeled points) s p 0 = pickoneunlabeledpoint(σ, θ); L(p 0 ) = '*'; /*labeling p 0 as processed*/ QUEUE = InsertQueue(QUEUE, p 0 ); /* insert p 0 into a Queue */ while ( not Empty(QUEUE) ) /*get an element from the head of QUEUE*/ p 0 = RemoveQueue(QUEUE); For each p that is adjacent to p 0 if ( L(p) <> '*' and C(p, p 0 ) 1-λ) QUEUE = InsertQueue(QUEUE, p); L(p) = 0 s; S' = { p:l(p)=`0 }; /* S' is a λ-connected component*/ if (S' has more points than S) S = S'; /* save the largest component to S */ return(s); 32

Global Weather Data Global data of water vapor Multiple-parameter data with resolution of 1 degree by 1 degree Covers whole earth and is updated 4 times a day 33

Mexican Hat Wavelet with Locations and Scales The variation exists on all scales Power of variation changes at different locations Mexican hat wavelet has a satisfactory localization resolution 34

Wavelet transform A high value does not necessarily correspond to a high wavelet power Wavelet power mainly represents the variation of the signal for a particular scale 35

Perform Wavelet Transform along X dimension (Latitude) Include only particular scales of interest (2 and 3) Two spatial outliers Over south America (Center at 27 S and 55 W): tropical storm Over Gulf of Mexican (Center at 27 N and 90 W): hurricane 36

The Problem of transforming along the Y-axis (longitude) Reveal more patterns than the reconstructed data from wavelet transform along X-axis (latitude) These patterns are caused by the normal variation along the longitude Y and are noises in most cases 37

Experiment: Image Segmentation Reconstruction of water vapor at 0Am on 9/18, 2003 with Hurricane Isabel identified Reconstruction of water vapor at 6Am on 9/18, 2003 with Hurricane Isabel identified 38

Experiment: Tracking Movement 12 consecutive detected Isabel regions in 3 days 6 hour interval between two adjacent regions Noisy data might exist due to other weather patterns or inappropriate segmentation parameters Isabel moves northwestward Trajectory of moving region with noisy data Trajectory of moving region with noisy data removed 39

Outline Single-Attribute Spatial Outlier Detection Multi-Attribute Spatial Outlier Detection Region Outlier Detection Conclusions 40

Summary Single Attribute Spatial Outlier Z-value, Iterative, Median Multi-Attribute Spatial Outlier Two multivariate spatial outlier detection algorithms based on difference or ratio. Order the degree of spatial outlier-ness w.r.t Mahalanobis distance Region Outlier Detection based on wavelet transform and image segmentation On-line processing approach to tracking movement of outlier region in a data stream 41

Future Directions Multi-attribute spatial-temporal outliers Region outlier in three dimensional space with multiple attributes Track multiple moving outlier regions Remove the limitation (assumption) of multivariate normal distribution Widely used informal method: box plot approach Investigate the issue of handling large diskresident data set Minimize the number of disk page reads or passes 42

Related Publications Related Publications C.T. Lu, D. Chen, Y. Kou, Algorithms for Spatial Outlier Detection, IEEE International Conference on Data Mining, 2003 C.T. Lu, D. Chen, Y. Kou, Detecting Spatial Outliers with Multiple Attribute, IEEE International Conference on Tools with Artificial Intelligence, 2003 J. Zhao, C.T. Lu, Y. Kou, Detecting Region Outliers in Meteorological Data, Proceedings of the 11th International Symposium on Advances in Geographic Information Systems, New Orleans, Louisiana, pp. 49-55, Nov. 7-8, 2003. 43

Links Mapview: http://europa.nvc.cs.vt.edu/~ctlu/project/mapview/index.htm Mapcube: http://europa.nvc.cs.vt.edu/~ctlu/project/mapcube/mapcube.htm 44

Q & A ctlu@vt.edu 45