Spatial Data Management

Similar documents
Spatial Data Management

Principles of Data Management. Lecture #14 (Spatial Data Management)

Multidimensional Indexing The R Tree

What we have covered?

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Tree-Structured Indexes

Advanced Data Types and New Applications

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser

Solid Modeling. Thomas Funkhouser Princeton University C0S 426, Fall Represent solid interiors of objects

Advanced Data Types and New Applications

CSE 530A. B+ Trees. Washington University Fall 2013

Multidimensional Indexes [14]

Query Processing and Advanced Queries. Advanced Queries (2): R-TreeR

Multidimensional Data and Modelling - DBMS

Tree-Structured Indexes

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Course Content. Objectives of Lecture? CMPUT 391: Spatial Data Management Dr. Jörg Sander & Dr. Osmar R. Zaïane. University of Alberta

Tree-Structured Indexes. Chapter 10

R-Trees. Accessing Spatial Data

AN OVERVIEW OF SPATIAL INDEXING WITHIN RDBMS

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

layers in a raster model

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Introduction to Spatial Database Systems

Geometric Algorithms. Geometric search: overview. 1D Range Search. 1D Range Search Implementations

CMPUT 391 Database Management Systems. Spatial Data Management. University of Alberta 1. Dr. Jörg Sander, 2006 CMPUT 391 Database Management Systems

So, we want to perform the following query:

Tree-Structured Indexes

CPSC / Sonny Chan - University of Calgary. Collision Detection II

Chapter 25: Spatial and Temporal Data and Mobility

An Introduction to Spatial Databases

Roadmap DB Sys. Design & Impl. Reference. Detailed roadmap. Spatial Access Methods - problem. z-ordering - Detailed outline.

I/O-Algorithms Lars Arge Aarhus University

Chapter 11: Indexing and Hashing

Summary. 4. Indexes. 4.0 Indexes. 4.1 Tree Based Indexes. 4.0 Indexes. 19-Nov-10. Last week: This week:

External-Memory Algorithms with Applications in GIS - (L. Arge) Enylton Machado Roberto Beauclair

Data Organization and Processing

Tree-Structured Indexes

Multidimensional (spatial) Data and Modelling (2)

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

Tree-Structured Indexes

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Introduction. Choice orthogonal to indexing technique used to locate entries K.

amiri advanced databases '05

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CSE 512 Course Project Operation Requirements

Spatial Data Structures

Tree-Structured Indexes

CMSC 754 Computational Geometry 1

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Multimedia Database Systems

Spatial Data Structures

Data Warehousing & Data Mining

Indexing the Positions of Continuously Moving Objects

Chapter 12: Indexing and Hashing. Basic Concepts

Geometric data structures:

Chap4: Spatial Storage and Indexing

Intro to DB CHAPTER 12 INDEXING & HASHING

Introduction to Spatial Database Systems. Outline

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

Chapter 12: Indexing and Hashing

LSGI 521: Principles of GIS. Lecture 5: Spatial Data Management in GIS. Dr. Bo Wu

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Chap4: Spatial Storage and Indexing. 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary

I think that I shall never see A billboard lovely as a tree. Perhaps unless the billboards fall I ll never see a tree at all.

Context. File Organizations and Indexing. Cost Model for Analysis. Alternative File Organizations. Some Assumptions in the Analysis.

Chapter 5. Indexing for DWH

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

Overview of Storage and Indexing

Lecture 8 Index (B+-Tree and Hash)

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Autonomous and Mobile Robotics Prof. Giuseppe Oriolo. Motion Planning 1 Retraction and Cell Decomposition

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Indexing and Hashing

Index Tuning. Index. An index is a data structure that supports efficient access to data. Matching records. Condition on attribute value

Computer Graphics. Bing-Yu Chen National Taiwan University The University of Tokyo

Chapter 12: Indexing and Hashing

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Introduction to Data Management. Lecture 21 (Indexing, cont.)

11 - Spatial Data Structures

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Introduction to Data Management. Lecture 15 (More About Indexing)

Chapter 2: The Game Core. (part 2)

Topic 5: Raster and Vector Data Models

CMSC724: Access Methods; Indexes 1 ; GiST

Spatial Data Structures

Spatial Data Structures for Computer Graphics

Tree-Structured Indexes

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

The B-Tree. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Indexing: B + -Tree. CS 377: Database Systems

Chapter 17 Indexing Structures for Files and Physical Database Design

Evaluation of Relational Operations: Other Techniques

Transcription:

Spatial Data Management [R&G] Chapter 28 CS432 1

Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite imagery, where each pixel stores a measured value E.g., Feature vectors extracted from text Region Data Objects have spatial extent with location and boundary DB typically uses geometric approximations constructed using line segments, polygons, etc., called vector data. CS432 2

Types of Spatial Queries Spatial Range Queries Find all cities within 50 miles of Ithaca Query has associated region (location, boundary) Answer includes ovelapping or contained data regions Nearest-Neighbor Queries Find the 10 cities nearest to Ithaca Results must be ordered by proximity Spatial Join Queries Find all cities near a lake Expensive, join condition involves regions and proximity CS432 3

Applications of Spatial Data Geographic Information Systems (GIS) E.g., ESRI s ArcInfo; OpenGIS Consortium Geospatial information All classes of spatial queries and data are common Computer-Aided Design/Manufacturing Store spatial objects such as surface of airplane fuselage Range queries and spatial join queries are common Multimedia Databases Images, video, text, etc. stored and retrieved by content First converted to feature vector form; high dimensionality Nearest-neighbor queries are the most common CS432 4

Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create a composite search key B+ tree, e.g., an index on <age, sal>, we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal. Consider entries: <11, 80>, <12, 10> <12, 20>, <13, 75> 11 12 13 CS432 5 80 70 60 50 40 30 20 10 B+ tree order

Multidimensional Indexes A multidimensional index clusters entries so as to exploit nearness in multidimensional space. Keeping track of entries and maintaining a balanced index structure presents a challenge! Consider entries: <11, 80>, <12, 10> <12, 20>, <13, 75> 80 70 60 50 40 30 20 10 11 12 13 Spatial clusters B+ tree order CS432 6

Motivation for Multidimensional Indexes Spatial queries (GIS, CAD). Find all hotels within a radius of 5 miles from the conference venue. Find the city with population 500,000 or more that is nearest to Kalamazoo, MI. Find all cities that lie on the Nile in Egypt. Find all parts that touch the fuselage (in a plane design). Similarity queries (content-based retrieval). Given a face, find the five most similar faces. Multidimensional range queries. 50 < age < 55 AND 80K < sal < 90K CS432 7

What s the difficulty? An index based on spatial location needed. One-dimensional indexes don t support multidimensional searching efficiently. (Why?) Hash indexes only support point queries; want to support range queries as well. Must support inserts and deletes gracefully. Ideally, want to support non-point data as well (e.g., lines, shapes). The R-tree meets these requirements, and variants are widely used today. CS432 8

The R-Tree The R-tree is a tree-structured index that remains balanced on inserts and deletes. Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension. Example in 2-D: Root of R Tree CS432 9 Y X Leaf level

R-Tree Properties Leaf entry = < n-dimensional box, rid > This is Alternative (2), with key value being a box. Box is the tightest bounding box for a data object. Non-leaf entry = < n-dim box, ptr to child node > Box covers all boxes in child node (in fact, subtree). All leaves at same distance from root. Nodes can be kept 50% full (except root). Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full. CS432 10

Example of an R-Tree Leaf entry R1 R3 R8 R4 R9 R10 R11 R5 R14 R13 Index entry Spatial object approximated by bounding box R8 R12 R7 R18 R6 R16 R17 R19 R15 R2 CS432 11

Example R-Tree (Contd.) R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 CS432 12

Search for Objects Overlapping Box Q Start at root. 1. If current node is non-leaf, for each entry <E, ptr>, if box E overlaps Q, search subtree identified by ptr. 2. If current node is leaf, for each entry <E, rid>, if E overlaps Q, rid identifies an object that might overlap Q. Note: May have to search several subtrees at each node! (In contrast, a B-tree equality search goes to just one leaf.) CS432 13

Improving Search Using Constraints It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly. But why not use convex polygons to approximate query regions more accurately? Will reduce overlap with nodes in tree, and reduce the number of nodes fetched by avoiding some branches altogether. Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win. CS432 14

Insert Entry <B, ptr> Start at root and go down to best-fit leaf L. Go to child whose box needs least enlargement to cover B; resolve ties by going to smallest area child. If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2. Adjust entry for L in its parent so that the box now covers (only) L1. Add an entry (in the parent node of L) for L2. (This could cause the parent node to recursively split.) CS432 15

Splitting a Node During Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and L2. Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries. Idea: Redistribute so as to minimize area of L1 plus area of L2. Exhaustive algorithm is too slow; quadratic and linear heuristics are described in the paper. GOOD SPLIT! BAD! CS432 16

R-Tree Variants The R* tree uses the concept of forced reinserts to reduce overlap in tree nodes. When a node overflows, instead of splitting: Remove some (say, 30% of the) entries and reinsert them into the tree. Could result in all reinserted entries fitting on some existing pages, avoiding a split. R* trees also use a different heuristic, minimizing box perimeters rather than box areas during insertion. Another variant, the R+ tree, avoids overlap by inserting an object into multiple leaves if necessary. Searches now take a single path to a leaf, at cost of redundancy. CS432 17

GiST The Generalized Search Tree (GiST) abstracts the tree nature of a class of indexes including B+ trees and R-tree variants. Striking similarities in insert/delete/search and even concurrency control algorithms make it possible to provide templates for these algorithms that can be customized to obtain the many different tree index structures. B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs. GiST provides an alternative for implementing other tree indexes in an ORDBS. CS432 18

Indexing High-Dimensional Data Typically, high-dimensional datasets are collections of points, not regions. E.g., Feature vectors in multimedia applications. Very sparse Nearest neighbor queries are common. R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions. As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases; nearest neighbor is not meaningful. In any given data set, advisable to empirically test contrast. CS432 19

Summary Spatial data management has many applications, including GIS, CAD/CAM, multimedia indexing. Point and region data Overlap/containment and nearest-neighbor queries Many approaches to indexing spatial data R-tree approach is widely used in GIS systems Other approaches include Grid Files, Quad trees, and techniques based on space-filling curves. For high-dimensional datasets, unless data has good contrast, nearest-neighbor may not be wellseparated CS432 20

Comments on R-Trees Deletion consists of searching for the entry to be deleted, removing it, and if the node becomes under-full, deleting the node and then re-inserting the remaining entries. Overall, works quite well for 2 and 3 D datasets. Several variants (notably, R+ and R* trees) have been proposed; widely used. Can improve search performance by using a convex polygon to approximate query shape (instead of a bounding box) and testing for polygon-box intersection. CS432 21