Unified approach to detecting spatial outliers Shashi Shekhar, Chang-Tien Lu And Pusheng Zhang. Pekka Maksimainen University of Helsinki 2007

Similar documents
Traffic Volume(Time v.s. Station)

DETECTION OF ANOMALIES FROM DATASET USING DISTRIBUTED METHODS

Spatial Outlier Detection

/4 Directions: Graph the functions, then answer the following question.

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Algebra 1, 4th 4.5 weeks

Quickstart for Web and Tablet App

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

TI- Nspire Testing Instructions

Quickstart for Desktop Version

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Polymath 6. Overview

Test Name: Chapter 3 Review

This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight

Introduction. Product List. Design and Functionality 1/10/2013. GIS Seminar Series 2012 Division of Spatial Information Science

6.1 The Rectangular Coordinate System

The following sections provide step by step guidance on using the Toe Extractor. 2.1 Creating an Instance of the Toe Extractor Point Cloud Task

Glossary Common Core Curriculum Maps Math/Grade 6 Grade 8

Improvements in Continuous Variable Simulation with Multiple Point Statistics

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle.

ALGEBRA II A CURRICULUM OUTLINE

STA 570 Spring Lecture 5 Tuesday, Feb 1

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

Rational Numbers and the Coordinate Plane

ECE 176 Digital Image Processing Handout #14 Pamela Cosman 4/29/05 TEXTURE ANALYSIS

Exploratory Data Analysis EDA

Polygons in the Coordinate Plane

REVIEW, pages

Year 10 General Mathematics Unit 2

Excel Primer CH141 Fall, 2017

Middle School Math Course 3

You should be able to plot points on the coordinate axis. You should know that the the midpoint of the line segment joining (x, y 1 1

Geostatistics 2D GMS 7.0 TUTORIALS. 1 Introduction. 1.1 Contents

Time Series Analysis DM 2 / A.A

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Interactive Math Glossary Terms and Definitions

Unsupervised Learning and Clustering

LINEAR FUNCTIONS, SLOPE, AND CCSS-M

Chapter 25: Spatial and Temporal Data and Mobility

Unit Maps: Grade 8 Math

Storage and Access Methods for Advanced Traveler Information Systems ... Report 9626 S~1PT O TANSPORTAT57ON E.S. Number 96-U CTS TE

Chapters 1 7: Overview

Mathematics Scope & Sequence Grade 8 Revised: June 2015

Mineração de Dados Aplicada

Unsupervised Learning and Clustering

Segmentation and Grouping

Section Graphs and Lines

Multi-level Partition of Unity Implicits

Computer Vision Group Prof. Daniel Cremers. 6. Boosting

Course of study- Algebra Introduction: Algebra 1-2 is a course offered in the Mathematics Department. The course will be primarily taken by

Section 7D Systems of Linear Equations

Multi-Class Segmentation with Relative Location Prior

Basic Calculator Functions

Dealing with Data in Excel 2013/2016

1. Determine the population mean of x denoted m x. Ans. 10 from bottom bell curve.

Prentice Hall Mathematics: Course Correlated to: Massachusetts State Learning Standards Curriculum Frameworks (Grades 7-8)

8. Automatic Content Analysis

Things to Know for the Algebra I Regents

Data Mining Clustering

Data Analysis Guidelines

Each time a file is opened, assign it one of several access patterns, and use that pattern to derive a buffer management policy.

Table of Contents (As covered from textbook)

University of Florida CISE department Gator Engineering. Clustering Part 4

Multidimensional Data and Modelling - DBMS

Edge and local feature detection - 2. Importance of edge detection in computer vision

Generative and discriminative classification techniques

Getting to Know Your Data

Spatial Interpolation & Geostatistics

INDEPENDENT SCHOOL DISTRICT 196 Rosemount, Minnesota Educating our students to reach their full potential

Integrated Mathematics I Performance Level Descriptors

Grade 9 Math Terminology

Distance-based Methods: Drawbacks

Spatial Analysis and Modeling (GIST 4302/5302) Guofeng Cao Department of Geosciences Texas Tech University

Performance Level Descriptors. Mathematics

A-SSE.1.1, A-SSE.1.2-

Chapter 3. Image Processing Methods. (c) 2008 Prof. Dr. Michael M. Richter, Universität Kaiserslautern

C E N T E R A T H O U S T O N S C H O O L of H E A L T H I N F O R M A T I O N S C I E N C E S. Image Operations I

Clustering Part 4 DBSCAN

Regression III: Advanced Methods

Unit Maps: Grade 8 Math

Clustering in Data Mining

Three-Dimensional (Surface) Plots

MATH ALGEBRA AND FUNCTIONS 5 Performance Objective Task Analysis Benchmarks/Assessment Students:

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Topic 4 Image Segmentation

Spatial Interpolation - Geostatistics 4/3/2018

Minnesota Academic Standards for Mathematics 2007

Exploring and Understanding Data Using R.

DIGITAL IMAGE ANALYSIS. Image Classification: Object-based Classification

Slide 1 / 96. Linear Relations and Functions

Beyond Competent (In addition to C)

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

EXCEL 98 TUTORIAL Chemistry C2407 fall 1998 Andy Eng, Columbia University 1998

Cluster Analysis for Microarray Data

Homework 1 Excel Basics

Name Class Date. Using Graphs to Relate Two Quantities

+ = Spatial Analysis of Raster Data. 2 =Fault in shale 3 = Fault in limestone 4 = no Fault, shale 5 = no Fault, limestone. 2 = fault 4 = no fault

Data Analyst Nanodegree Syllabus

Hands-On Standards Deluxe Grades: 7, 8 States: California Content Standards

Transcription:

Unified approach to detecting spatial outliers Shashi Shekhar, Chang-Tien Lu And Pusheng Zhang Pekka Maksimainen University of Helsinki 2007

Spatial outlier Outlier Inconsistent observation in data set Spatial outlier Inconsistent attribute value in spatially referenced object Spatial values (location, shape,...) are not of importance Local instability Extreme attribute values compared to neighbors page 2

Application domains Transportation, ecology, public safety, public health, climatology, location based services,... Minnesota Department of Transportation Traffic Management Center Freeway Operations group traffic measurements 900 sensor stations Attributes Volume of traffic on the roads Occupancy Sensor ID. page 3

Traffic data Spatial attribute Sensor location S = {s1, s2, s3,..., sn} <Highway, milepoint> Directed graph indicating road between two sensor locations (eg. s1 s2) Attribute data Traffic volume, occupancy of road We are interested in finding locations which are different than their neighbors that is outliers! page 4

Measuring outlierness Set of definitions f(x) attribute value for location x N(x) neighborhood of location x E y N x f y average attribute value for neighbors of x S x =[ f x E y N x f y ] difference of x's attribute value to it's neighbors For normally distributed f(x) we can measure outlierness by Z S x = S x S / s μs is the mean value of S(x) σs is the standard deviation of S(x) Choice for θ specifies confidence level Confidence level of 95% ~ θ = 2 page 5

Measuring outlierness (2) More general definition for outlierness N f aggr :ℝ N ℝ aggregate function for the values of f over neighborhood F diff : ℝ ℝ ℝ difference function ST : ℝ {True, False} statistical test for significance For finding outliers we can define above functions in different fashion to find outliers. In previous slide we defined aggregate function to average attribute value of node x's neighbors. Difference function was the arithmetic difference between f(x) and aggregate value. Statistical significance was defined with the help of mean and standard deviation. N Object O is an S-outlier f, f aggr, F diff, ST if N ST { F diff [ f x, f aggr f x, N x ] } is true page 6

Measuring outlierness (3) DB(p, D)-outlier (distance based) Statistical significance measure p (fraction of nodes) N objects in set T Object O is a DB(p, D)-outlier if atleast fraction p of the objects in T lie greater than distance D from O N f Let aggr be the number of objects within the distance D from object O Statistical test function can be defined as N N f aggr x / N p page 7

Related methods Non-spatial methods are not fit for detecting spatial outliers Node G is an outlier because it's attribute value exceeds the threshold on normal distribution limits Spatial location is not considered page 8

Related methods (2) Outlier detection method categories Homogeneous methods don't differentiate between attribute dimensions and spatial dimensions Homogeneous methods use all dimensions for defining neighborhood as well as for comparison page 9

Related methods (3) Bi-partite multi-dimensional tests are designed to detect spatial outliers Spatial attributes characterize location, neighborhood and distance Non-spatial attributes are used to compare object to its neighbors Two kinds of bi-partite multi-dimensional tests Graphical tests Visualization of spatial data which highlights spatial outliers Quantitative tests Precise test to distinguish spatial outliers page 10

Variogram-cloud Variogram-cloud displays objects related by neighborhood relationships For each pair of locations plot the following values Square root of the absolute difference between attribute values Distance between the locations Locations that are near to eachother but large attribute differences might indicate spatial outlier Point S can be identified as spatial outlier page 11

Moran scatterplot Moran scatterplot is a plot of normalized attribute value against the neighborhood average of normalized attribute values Z [ f i = f i f / f ] Upper left and lower right quadrants indicate spatial association of dissimilar values Points P and Q are surrounded by high value neighbors Point S is surrounded by low value neighbors Spatial outliers page 12

Scatterplot Scatterplot shows attribute values on the X-axis and the average of the attribute values in the neighborhood on the Y-axis Best fit regression line is used to identify spatial outliers Positive autocorrelation Scatter regression slopes to the right Negative autocorrelation Scatter regression slopes to the left Vertical difference of a data point tells about outlierness page 13

Spatial statistic ZS(x) test Spatial statistic test shows the location of data points in 1-D space on X-axis and statistic test values for each data point on Y-axis Point S has ZS(x) value exceeding 3 and will be detected as spatial outlier Because S is an outlier the neighboring data points P and Q have values close to -2 page 14

Spatial outlier detection problem Objective is to design a computationally efficient algorithm to detect S-outliers Previously introduced functions and measumerements are used (aggregates, difference functions, neighborhoods,...) Constraints The size of data is greater than main memory size Computation time is determined by I/O time page 15

Model building Model building: efficient computation method to compute the global statistical parameters using a spatial join Distributive aggregate functions min, max, sum, count,... Algebraic aggregate functions mean, standard deviation,... These values can be computed by single scanning of the data set I/O reads Algebraic aggregate functions can be used by difference function Fdiff and statistical test function ST page 16

Model building algorithm page 17

Effectiveness of model building Efficiency depends greatly on I/O Most time consuming process in model building algorithm is the method Find_Neighbor_Nodes_Set() If neighboring nodes are not in memory then extra I/O read must be done Idea: try to cluster each node with its neighbors to same disk page Clustering efficiency Practically CE defines the execution time page 18

Clustering efficiency To get neighbors of node v1 pages A and B must be read. Page A however is already in memory because v1 was read from there. For node v3 no extra reads are needed CE = 6 CE= =0.67 9 Total number of unsplit edges Total number of edges page 19

Route outlier detection Route outlier detection (ROD) detects outliers on the user given route ROD retrieves the neighboring nodes for each node in given route RN N Compute neighborhood aggregate function F aggr Difference function Fdiff is computed using the attribute function f(x), neighborhood aggregate function and the algebraic aggregate functions computed in the model building algorithm Test node x using the statistical test function ST page 20

Route outlier detection (ROD) algorithm page 21

Clustering methods Connectivity-clustered access method (CCAM) Cluster the nodes via graph partitioning Graph partitioning methods Secondary index to support query operations B+-tree with Z-order Grid File, R-tree,... Linear clustering by Z-order Cell-tree page 22

Clustering methods - CCAM Data page Graph partitioning B+-tree index page 23

Clustering methods - Z-order Nodes in different partitions are Z-order (0, 1, 3, 4) (5, 6, 7, 8) (9, 11, 12, 13) (14, 15, 18 26) page 24

Clustering methods - cell-tree Nodes in different partitions are BSP partitioning (0, 1, 3, 6) (4, 5, 7, 18) (8, 9, 11, 14) (12, 13, 15, 26) page 25

Buffering methods Of course, buffers matter too FIFO MRU LRU Buffer size (page size) 1k, 4k, 8k,... Effect to model building Effect to ROD algorithm page 26

Outliers detected White vertical line (14.45) temporal outlier White square (8.20-10.00) spatial-temporal outlier Station 9 inconsistent traffic flow page 27