Data Organization and Processing

Similar documents
Spatial Join Techniques

Introduction to Spatial Database Systems

Multidimensional Data and Modelling - DBMS

Extending Rectangle Join Algorithms for Rectilinear Polygons

An Introduction to Spatial Databases

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Multidimensional (spatial) Data and Modelling (2)

CMPUT 391 Database Management Systems. Spatial Data Management. University of Alberta 1. Dr. Jörg Sander, 2006 CMPUT 391 Database Management Systems

Course Content. Objectives of Lecture? CMPUT 391: Spatial Data Management Dr. Jörg Sander & Dr. Osmar R. Zaïane. University of Alberta

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

CMSC 754 Computational Geometry 1

Polygon Partitioning. Lecture03

CSE 512 Course Project Operation Requirements

What we have covered?

Spatial Data Management

Multidimensional Indexing The R Tree

Spatial Data Management

Intersection Acceleration

Multidimensional Indexes [14]

Cost Models for Query Processing Strategies in the Active Data Repository

External-Memory Algorithms with Applications in GIS - (L. Arge) Enylton Machado Roberto Beauclair

Spatial Data Structures

Page 1. Area-Subdivision Algorithms z-buffer Algorithm List Priority Algorithms BSP (Binary Space Partitioning Tree) Scan-line Algorithms

R-Trees. Accessing Spatial Data

Spatial Data Structures

Orthogonal range searching. Orthogonal range search

Introduction to Spatial Database Systems. Outline

Computational Geometry

Evaluating Queries. Query Processing: Overview. Query Processing: Example 6/2/2009. Query Processing

Announcements. Written Assignment2 is out, due March 8 Graded Programming Assignment2 next Tuesday

Potentials for Improving Query Processing in Spatial Database Systems

Experimental Evaluation of Spatial Indices with FESTIval

Computational Geometry

Planar Point Location

Spatial Queries. Nearest Neighbor Queries

Polygon Triangulation. (slides partially by Daniel Vlasic )

A Minimalist s Implementation of the 3-d Divide-and-Conquer Convex Hull Algorithm

Principles of Data Management. Lecture #14 (Spatial Data Management)

Parallel Physically Based Path-tracing and Shading Part 3 of 2. CIS565 Fall 2012 University of Pennsylvania by Yining Karl Li

Computational Geometry

Chap4: Spatial Storage and Indexing. 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary

CSE 530A. B+ Trees. Washington University Fall 2013

External Memory Algorithms for Geometric Problems. Piotr Indyk (slides partially by Lars Arge and Jeff Vitter)

Database Management Systems. Objectives of Lecture 10. The Need for a DBMS

Computer Graphics. Bing-Yu Chen National Taiwan University The University of Tokyo

layers in a raster model

Solid Modeling. Thomas Funkhouser Princeton University C0S 426, Fall Represent solid interiors of objects

Foundations of Multidimensional and Metric Data Structures

Spatial Data Structures

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Efficient Processing of Spatial Joins Using R-trees

January 10-12, NIT Surathkal Introduction to Graph and Geometric Algorithms

Hierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach

Spatial Data Structures and Speed-Up Techniques. Tomas Akenine-Möller Department of Computer Engineering Chalmers University of Technology

Spatial Data Structures

Collision detection. Piotr Fulma«ski. 1 grudnia

26 The closest pair problem

The Unified Segment Tree and its Application to the Rectangle Intersection Problem

Collision and Proximity Queries

Speeding Up Ray Tracing. Optimisations. Ray Tracing Acceleration

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Chap4: Spatial Storage and Indexing

Ray-tracing Acceleration. Acceleration Data Structures for Ray Tracing. Shadows. Shadows & Light Sources. Antialiasing Supersampling.

CS 532: 3D Computer Vision 14 th Set of Notes

Lecture 3: Art Gallery Problems and Polygon Triangulation

Computer Graphics. The Two-Dimensional Viewing. Somsak Walairacht, Computer Engineering, KMITL

External Memory Algorithms and Data Structures Fall Project 3 A GIS system

Simulation in Computer Graphics Space Subdivision. Matthias Teschner

Computation Geometry Exercise Range Searching II

Data Warehousing & Data Mining

Accelerating Geometric Queries. Computer Graphics CMU /15-662, Fall 2016

Last Time: Acceleration Data Structures for Ray Tracing. Schedule. Today. Shadows & Light Sources. Shadows

Query Processing and Advanced Queries. Advanced Queries (2): R-TreeR

TOUCH: In-Memory Spatial Join by Hierarchical Data-Oriented Partitioning

Algorithms for GIS:! Quadtrees

CoE4TN4 Image Processing

Prof. Gill Barequet. Center for Graphics and Geometric Computing, Technion. Dept. of Computer Science The Technion Haifa, Israel

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

8. Hidden Surface Elimination

K-structure, Separating Chain, Gap Tree, and Layered DAG

I/O-Algorithms Lars Arge Aarhus University

Collision detection. Piotr Fulma«ski. 19 pa¹dziernika Wydziaª Matematyki i Informatyki, Uniwersytet Šódzki, Polska

Advanced Ray Tracing

Collision Detection based on Spatial Partitioning

Computational geometry

Computational Geometry

CEng 477 Introduction to Computer Graphics Fall 2007

Today. Acceleration Data Structures for Ray Tracing. Cool results from Assignment 2. Last Week: Questions? Schedule

Polygon decomposition. Motivation: Art gallery problem

CS6100: Topics in Design and Analysis of Algorithms

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Computer Graphics. Bing-Yu Chen National Taiwan University

Advanced Data Types and New Applications

Lesson 05. Mid Phase. Collision Detection

Chapter 12: Query Processing. Chapter 12: Query Processing

Outline. Motivation. Traditional Database Systems. A Distributed Indexing Scheme for Multi-dimensional Range Queries in Sensor Networks

Geometric and Thematic Integration of Spatial Data into Maps

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

8. Hidden Surface Elimination

CPSC / Sonny Chan - University of Calgary. Collision Detection II

Transcription:

Data Organization and Processing Spatial Join (NDBI007) David Hoksza http://siret.ms.mff.cuni.cz/hoksza

Outline Spatial join basics Relational join Spatial join

Spatial join definition (1) Given two sets of multi-dimensional objects in Euclidean space, a spatial join finds all pairs of objects satisfying a given spatial relation between the objects, such as intersection. Simplified spatial join Given two sets of rectangles, R and S, find all of the pairs of intersecting rectangles between the two sets, i.e. { r, s : r s, r R, s S}

Spatial join definition (2) Spatial overlay join (general spatial join) the data set can consist of general spatial objects (points, lines, polygons) the data sets can have more than two dimensions the relation between pairs of spatial objects can be any spatial relation (nearness, enclosure, direction, ) there can be more sets in the relation (multiway spatial join) or one set joined with itself (self spatial join)

Spatial join examples Find all pair of rivers and cities that intersect. Find all of the rural areas that are below sea level, given an elevation and land use map. Find the houses inside the areas with poor slope stability.

Non-spatial join There exist algorithms for standard relational join only equi-joins (the join predicate is equality) considered here nested loop join (hnízděné cykly) sort-merge join (setřídění-slévání) hash join (hashované spojení) most of the standard relational join algorithms are not suitable for spatial data because the join condition involves multidimensional spatial attribute

Non-spatial nested loop join Nested loop join checks one by one for each element of a relation R all elements in relation in S FOREACH r R DO FOREACH s S DO IF cond(r,s) THEN REPORT(r,s) In its basic version, the nested loop join is the least efficient algorithm from the relational joins algorithms The only of the standard relational join algorithms applicable also to spatial data Therefore special spatial join algorithms had to be developed

Non-spatial sort-merge join Two-phase algorithm sort both relations R, S independently scan both relations at once in the same order and join 01 02 02 04 05 05 02 02 02 03 04 04 05 Multi-dimensional data do not preserve proximity so this method (used as is) is not applicable to spatial data

Non-spatial hash join Hash join One of the relations (R) is hashed with a hash function h (the assumption that R fits into main memory) The other relation (S) is processed one by one and the elements ids are hashed with h If for two elements r R, s S: h r = h(s) then r and s are checked for r. cmpr_attributes = s. cmpr_attributes Equijoins rely on grouping objects with the same value which is not possible for spatial objects since these have an extent

Filter-refine policy The objects to be searched for can be very complex testing of the join condition (spatial predicate) itself can be highly time demanding not many objects fit into main memory filter-refine strategy Spatial objects are approximated using simple spatial objects Filter-refine filter - the spatial join is conducted using the objects approximations candidate set refine pairs which pass the filter are tested for the spatial predicate using their full spatial representation

Approximations (1) The most common approximation is minimum bounding rectangle (MBR) the smallest rectangle fully enclosing given object whose sides are parallel to the axes Dead space the amount of space covered by the approximation but not the approximated object the approximation should aim to minimize dead space large dead space areas can lead to higher false hit rate in the filtering stage Large dead area False hit

Approximations (2) Conservative every point of the object is also point of the approximation concave convex MBR most common minimum bounding polygon minimum bounding circle/ellipse Progressive every point of the approximation is also point of the object maximum nested rectangles, circles,

Approximations (3) Object decomposition minimizes dead space by decomposing an object into disjoint fragments with their own MBRs after refine step and extra filtering stage is needed to remove duplicities

Spatial join methods Internal Memory Methods Nested loop join External Memory Methods Hierarchical traversal Index nested loop join Plane Sweep Partitioning-based methods Z-order

Internal memory methods Nested loop join Index nested loop join Plane Sweep Z-order

Nested loop join The algorithm is identical to the standard relational one works with arbitrary object type and join condition Given datasets A and B the join takes O( A B ) time suitable for small datasets NESTED_LOOP_JOIN(setA, setb, joincondition) INPUT: Sets to join and condition based on which the join happens OUTPUT: Pairs of objects satisfying the join condition FOREACH a seta FOREACH b setb IF Satisfied(a, b, joincondition) THEN REPORT(a, b);

Index nested loop join Variant of nested loop join where first a spatial index is created over one of the sets The indexed set is queried by each of the objects from the second set for intersection (window query) Given datasets A and B the join takes O(log A B ) time (not including time needed for building the index) INDEX_NESTED_LOOP_JOIN(setA, setb) INPUT: Sets to join OUTPUT: Pairs of intersecting objects ix := CreateSpatialIndex(setA) FOREACH b setb REPORT(ix.Search(b));

Plane sweep Two-phase algorithm for identification of intersecting rectangles sorting the rectangles in ascending order on the basis of their left sides (x-axis) sweeping a vertical scan line through the sorted list from left to right, halting at each of the rectangles lower x coordinates and identification of rectangles intersecting (vertical line) with the current rectangle and checking for intersection based on the y-axis Similar to non-spatial sort-merge join

Plane sweep algorithm (1) PLANE_SWEEP(setA, setb) INPUT: Sets of rectangles to join OUTPUT: Pairs of intersecting rectangles lista SortByLeftSide(setA); listb SortByLeftSide(setB); sweepstructa CreateSweepStructure(); sweepstructb CreateSweepStructure(); WHILE NOT(listA.End()) NOT (listb.end()) DO IF lista.first() < listb.first() THEN sweepstructa.insert(lista.first()); sweepstructb.removeinactive(lista.first()); sweepstructb.search(lista.first()); lista.next(); ELSE sweepstructb.insert(listb.first ()); sweepstructa.removeinactive(listb.first()); sweepstructa.search(listb.first()); listb.next();

Plane sweep algorithm (2) Sweep structure tracks active rectangles and has to support three operations Insert insert a rectangle into the active set RemoveInactive removes from the active set all rectangles that do not overlap a given rectangle (line) Search searches for all active rectangles that intersect a given rectangle and outputs them if the data are sorted, only the data in the sweep structures need to be kept in internal memory Given datasets A and B the algorithm takes O(log n n) where n = A + B, including the list sort time.

Z-ordering Z-order methods are based on representing an object using a set of continuous segments corresponding to Z-ordering object o { z 1, o, z n, o } The z-order-based representation can be processed in different ways sort-merge standard relational sort-merge but the sets are z-ordered plane-sweep the active set, instead of being the objects that intersect the sweep line, are the enclosing cells of the objects that intersect the current Z-order grid cell the sweep structure can be represented by a stack where each object in the stack intersects the query object

External memory methods Hierarchical traversal Partitioning-based methods

Hierarchical traversal Assumption - both datasets are indexed using a hierarchical index such as R-tree Synchronized traversal can be used to test the join condition the algorithm traverses the two trees in a synchronized fashion and compares bounding objects at given levels for indexes of different heights, the join of a leaf of one index with a sub-tree of the other can be accomplished using a window query, or handling leaf-to-node comparison as special cases if a node corresponding to a part of the space does not match the condition it can be excluded from the traversal iterative filter and refine approach reading a node usually corresponds to reading one memory page

General synchronized traversal INDEXED_TRAVERSAL_JOIN(rootA, rootb) INPUT: Roots of the structures representing the sets to be joined OUTPUT: Pairs of intersecting rectangles queue CreateQueue(); queue.add(pair(roota, rootb)); WHILE NOT(queue.Empty()) DO nodepair queue.pop(); pairs IdentifyIntersectingPairs(nodePair); FOREACH p pairs DO IF p is leaf THEN ReportIntersection(p); ELSE queue.add(p);

Partitioning Often applied when neither of the sets to be joined is indexed Resulting partitions should be small enough to fit in internal memory Once the data are partitioned, each pair of overlapping partitions is read into internal memory and internal memory techniques are used

Partition join of uniform data GRID_JOIN(setA, setb) INPUT: Sets of objects to be joined OUTPUT: Pairs of intersecting objects m AvailableInternalMemory(); mbrsize BytesToStoreMBR(); minnrofpartitions (seta.size() + setb.size())*mbrsize() / m; partlist DeterminePartitions(minNrOfPartitions); partitionpointersa PartitionData(partList, SetA); {object appears in every partition it intersects} partitionpointersb PartitionData(partList, SetB); FOREACH part partlist DO partitiona ReadPartition(partitionPointersA, part); partitionb ReadPartition(partitionPointersB, part); PLANE_SWEEP(partitionA, partitionb);

Avoiding duplicate results Space partitioning can lead to object duplication which can, in turn, lead to duplication of results partition border Deduplication sort and remove duplicates requires sorting which implies increased computational demands reference point method a consistently chosen reference point is selected from the intersecting region intersection is reported only if the reference point lies within given partition

Grid partitioning skewed distributions Basic grid algorithm is rarely used since the objects distribution is often not uniform [Patel & DeWitt; 1996] proposed to group partitions using a mapping function to minimize skew by creating partitions having similar number of items

Size-based partitioning [Koudas & Sevcik; 1997] Partitioning of data using a series of finer and finer grids level i grid contains 4 i 1 partitions (i horizontal and i vertical splitting axis) each object is placed into the partition associated with the grid in which the object does not intersect the grid lines in second-level grid are objects which cross level 3 partition borders (16 cells) but not level 2 partition borders (4 cells) each partition is a filter and objects fall through to the lowest level partition where a partition boundary is crossed Associated with level 2. Associated with level 1. Associated with level > 2.

Searching in size-based partitioning Given two spatial data sets A and B Scan data sets A and B and for each entity: 1. Compute the Hilbert (Z) value of the entity 2. Determine the level at which the entity belongs and place its entity descriptor in the corresponding level file For each level file 1. Sort by Hilbert (Z) value Perform a synchronized scan over the pages of level files Each page of each level file is read just once Entries in page A l (l-th level) need to be checked only against actual pages B m (m l)