Database Management Systems. Objectives of Lecture 10. The Need for a DBMS

Similar documents
Course Content. Objectives of Lecture? CMPUT 391: Spatial Data Management Dr. Jörg Sander & Dr. Osmar R. Zaïane. University of Alberta

CMPUT 391 Database Management Systems. Spatial Data Management. University of Alberta 1. Dr. Jörg Sander, 2006 CMPUT 391 Database Management Systems

Course Content. Object-Oriented Databases. Objectives of Lecture 6. CMPUT 391: Object Oriented Databases. Dr. Osmar R. Zaïane. University of Alberta 4

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

COURSE 11. Object-Oriented Databases Object Relational Databases

Spatial Data Management

Spatial Data Management

Multidimensional Data and Modelling - DBMS

Data Organization and Processing

Introduction to Spatial Database Systems

Principles of Data Management. Lecture #14 (Spatial Data Management)

R-Trees. Accessing Spatial Data

Multidimensional Indexing The R Tree

An Introduction to Spatial Databases

What we have covered?

Multidimensional (spatial) Data and Modelling (2)

Chapter 2: The Game Core. (part 2)

CMSC 754 Computational Geometry 1

Summary. 4. Indexes. 4.0 Indexes. 4.1 Tree Based Indexes. 4.0 Indexes. 19-Nov-10. Last week: This week:

Advanced Data Types and New Applications

Query Processing and Advanced Queries. Advanced Queries (2): R-TreeR

Multidimensional Indexes [14]

Data Warehousing & Data Mining

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Spatial Queries. Nearest Neighbor Queries

Spatial Data Structures

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Introduction to Spatial Database Systems. Outline

Spatial Data Structures

Chap4: Spatial Storage and Indexing. 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary

Advanced Data Types and New Applications

Computational Geometry

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Background: disk access vs. main memory access (1/2)

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Geometric Structures 2. Quadtree, k-d stromy

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

I/O-Algorithms Lars Arge Aarhus University

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree

Computational Geometry

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Lecture 3 February 9, 2010

{ K 0 (P ); ; K k 1 (P )): k keys of P ( COORD(P) ) { DISC(P): discriminator of P { HISON(P), LOSON(P) Kd-Tree Multidimensional binary search tree Tak

CSIT5300: Advanced Database Systems

Geometric data structures:

Data Mining and Machine Learning: Techniques and Algorithms

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

GEOMETRIC SEARCHING PART 1: POINT LOCATION

Spatial Data Structures

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

So, we want to perform the following query:

Spatial Data Structures

GITA 338: Spatial Information Processing Systems

CSE 530A. B+ Trees. Washington University Fall 2013

Chapter 1: Introduction to Spatial Databases

Spatial Data Structures and Speed-Up Techniques. Tomas Akenine-Möller Department of Computer Engineering Chalmers University of Technology

Chapter 25: Spatial and Temporal Data and Mobility

External Memory Algorithms and Data Structures Fall Project 3 A GIS system

LSGI 521: Principles of GIS. Lecture 5: Spatial Data Management in GIS. Dr. Bo Wu

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser

Balanced Box-Decomposition trees for Approximate nearest-neighbor. Manos Thanos (MPLA) Ioannis Emiris (Dept Informatics) Computational Geometry

Lecturer 2: Spatial Concepts and Data Models

Physical Level of Databases: B+-Trees

Computational Geometry

Prof. Gill Barequet. Center for Graphics and Geometric Computing, Technion. Dept. of Computer Science The Technion Haifa, Israel

15.4 Longest common subsequence

Accelerating Geometric Queries. Computer Graphics CMU /15-662, Fall 2016

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Announcements. Data Sources a list of data files and their sources, an example of what I am looking for:

Advanced 3D-Data Structures

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

Nearest Neighbor Search by Branch and Bound

CS210 Project 5 (Kd-Trees) Swami Iyer

File Structures and Indexing

Algorithms for GIS:! Quadtrees

Material You Need to Know

Spatial Data Structures

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Indexing: B + -Tree. CS 377: Database Systems

Storing And Using Multi-Scale Topological Data Efficiently In A Client-Server DBMS Environment

Chapter 11: Indexing and Hashing

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

15.4 Longest common subsequence

Linked Structures Songs, Games, Movies Part III. Fall 2013 Carola Wenk

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Parallel Physically Based Path-tracing and Shading Part 3 of 2. CIS565 Fall 2012 University of Pennsylvania by Yining Karl Li

External-Memory Algorithms with Applications in GIS - (L. Arge) Enylton Machado Roberto Beauclair

Module 4: Tree-Structured Indexing

Raster Data. James Frew ESM 263 Winter

Implementation Techniques

Data Structures and Algorithms

Chapter 3. Image Processing Methods. (c) 2008 Prof. Dr. Michael M. Richter, Universität Kaiserslautern

DDS Dynamic Search Trees

List of Practical for Master in Computer Application (5 Year Integrated) (Through Distance Education)

Spatial Analysis (Vector) II

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Matthew Heric, Geoscientist and Kevin D. Potter, Product Manager. Autometric, Incorporated 5301 Shawnee Road Alexandria, Virginia USA

Transcription:

Database Management Systems Winter 2004 CMPUT 391: Spatial Data Management Dr. Osmar. Zaïane University of Alberta Objectives of Lecture 10 Spatial Data Management Discuss limitations of the relational data model and briefly introduce the Extended- elational Model. This lecture will give you a basic understanding of spatial data management What is special about spatial data What are spatial queries How do typical spatial index structures work CMPUT 391 Database Management Systems University of Alberta 1 CMPUT 391 Database Management Systems University of Alberta 2 Spatial Data Management Shortcomings of elational Databases Modeling Spatial Data Spatial Queries Space-Filling Curves + B-Trees -trees The Need for a DBMS On one hand we have a tremendous increase in the amount of data applications have to handle, on the other hand we want a reduced application development time. Object-Oriented programming DBMS features: query capability with optimization, concurrency control, recovery, indexing, etc. Can we merge these two to get an object database management system since data is getting more complex? CMPUT 391 Database Management Systems University of Alberta 3 CMPUT 391 Database Management Systems University of Alberta 4

Manipulating New Kinds of Data A television channel needs to store video sequences, radio interviews, multimedia documents, geographical information, etc., and retrieve them efficiently. A movie producing company needs to store movies, frame sequences, data about actors and theaters, etc. (textbook example) A biological lab needs to store complex data about molecules, chromosomes, etc, and retrieve parts of data as well as complete data. Think about NHL data and commercial needs. What are the Needs? Images Video Multimedia in general Spatial data (GIS) Biological data CAD data Virtual Worlds Games List of lists User defined data types CMPUT 391 Database Management Systems University of Alberta 5 CMPUT 391 Database Management Systems University of Alberta 6 Shortcomings with DBMS Supports only a small fixed collection of relatively simple data types (integers, floating point numbers, date, strings) No set-valued attributes (sets, lists, ) No inheritance in the Is-a relationship No complex objects, apart from BLOB (binary large object) and CLOB (character large object) Impedance mismatch between data access language (declarative SQL) and host language (procedural C or Java): programmer must explicitly tell how things to be done. Is there a different solution? Existing Object Databases Object database is a persistent storage manager for objects: Persistent storage for object-oriented programming languages (C ++, SmallTalk,etc.) Object-Database Systems: Object-Oriented Database Systems: alternative to relational systems Object-elational Database Systems: Extension to relational systems Market: DBMS ( $8 billion), OODMS ($30 million) world-wide OODB Commercial Products: ObjectStore, GemStone, Orion, etc. CMPUT 391 Database Management Systems University of Alberta 7 CMPUT 391 Database Management Systems University of Alberta 8

DBMS Classification Matrix Object-elational Features of Oracle Methods Query No Query elational DBMS File System Simple Data Object-elational DBMS Object-Oriented DBMS Complex Data CEATE TPE ectangle_typ AS OBJECT ( len NUMBE, wid NUMBE, MEMBE FUNCTION area ETUN NUMBE, ); CEATE TPE BOD ectangle_typ AS MEMBE FUNCTION area ETUN NUMBE IS BEGIN ETUN len * wid; END area; END; CMPUT 391 Database Management Systems University of Alberta 9 CMPUT 391 Database Management Systems University of Alberta 10 Object-elational Features of Oracle Collection types / nested tables CEATE TPE PointType AS OBJECT ( x NUMBE, y NUMBE); CEATE TPE PolygonType AS TABLE OF PointType; CEATE TABLE Polygons ( name VACHA2(20), points PolygonType) NESTED TABLE points STOE AS PointsTable; The relations representing individual polygons are not stored directly as values of the points attribute; they are stored in a single table, PointsTable Spatial Data Management Shortcomings of elational Databases Modeling Spatial Data Spatial Queries Space-Filling Curves + B-Trees -trees CMPUT 391 Database Management Systems University of Alberta 11 CMPUT 391 Database Management Systems University of Alberta 12

elational epresentation of Spatial Data Example: epresentation of geometric objects (here: parcels/fields of land) in normalized relations Parcels Borders Points FNr BNr F 1 F 2 F 3 F 4 F 6 F 5 F 7 F1 F1 F1 F1 F4 F4 F4 F4 F4 F4 F7 F7 F7 F7 B1 B2 B3 B4 B2 B5 B6 B7 B8 B9 B7 B10 B11 B12 BNr PNr 1 PNr 2 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 P1 P2 P3 P4 P2 P5 P6 P7 P8 P6 P9 P10 P2 P3 P4 P1 P5 P6 P7 P8 P3 P9 P10 P7 PNr P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 -Coord -Coord P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 edundancy free representation requires distribution of the information over 3 tables: Parcels, Borders, Points P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 elational epresentation of Spatial Data For (spatial) queries involving parcels it is necessary to reconstruct the spatial information from the different tables E.g.: if we want to determine if a given point P is inside parcel F 2, we have to find all corner-points of parcel F 2 first SELECT Points.PNr, -Coord, -Coord FOM Parcels, Border, Points WHEE FNr = F 2 AND Parcel.BNr = Borders.BNr AND (Borders.PNr 1 = Points.PNr O Borders.PNr 2 = Points.PNr) Even this simple query requires expensive joins of three tables Querying the geometry (e.g., P in F 2?) is not directly supported. CMPUT 391 Database Management Systems University of Alberta 13 CMPUT 391 Database Management Systems University of Alberta 14 Extension of the elational Model to Support Spatial Data Integration of spatial data types and operations into the core of a DBMS ( object-oriented and object-relational databases) Data types such as Point, Line, Polygon Operations such as ObjectIntersect, angequery, etc. Advantages Natural extension of the relational model and query languages Facilitates design and querying of spatial databases Spatial data types and operations can be supported by spatial index structures and efficient algorithms, implemented in the core of a DBMS All major database vendors today implement support for spatial data and operations in their database systems via object-relational extensions Extension of the elational Model to Support Spatial Data Example elation: ForestZones(Zone: Polygon, ForestOfficial: String, Area: Cardinal) 2 4 6 3 1 5 ForestZones Zone ForestOfficial Area (m 2 ) The province decides that a reforestation is necessary in an area described by a polygon S. Find all forest officials affected by this decision. SELECT ForestOfficial FOM ForestZones WHEE ObjectIntersects (S, Zone) 1 2 3 4 5 6 Stevens Behrens Lee Goebel Jones Kent 3900 4250 6700 5400 1900 4600 CMPUT 391 Database Management Systems University of Alberta 15 CMPUT 391 Database Management Systems University of Alberta 16

Data Types for Spatial Objects Spatial objects are described by Spatial Extent location and/or boundary with respect to a reference point in a coordinate system, which is at least 2-dimensional. Basic object types: Point, Lines, Polygon Other Non-Spatial Attributes Thematic attributes such as height, area, name, land-use, etc. 2-dim. points 2-dim. lines 2-dim. polygons Crop Forest Spatial Data Management Shortcomings of elational Databases Modeling Spatial Data Spatial Queries Space-Filling Curves + B-Trees -trees Water CMPUT 391 Database Management Systems University of Alberta 17 CMPUT 391 Database Management Systems University of Alberta 18 Spatial Query Processing DBMS has to support two types of operations Operations to retrieve certain subsets of spatial object from the database Spatial Queries/Selections, e.g., window query, point query, etc. Operations that perform basic geometric computations and tests E.g., point in polygon test, intersection of two polygons etc. Spatial selections, e.g. in geographic information systems, are often supported by an interactive graphical user interface Point Query P Window Query W Basic Spatial Queries Containment Query: Given a spatial object, find all objects that completely contain. If is a Point: Point Query egion Query: Given a region (polygon or circle), find all spatial objects that intersect with. If is a rectangle: Window Query Enclosure Query: Given a polygon region, find all objects that are completely contained in K-Nearest Neighbor Query: Given an object P, find the k objects that are closest to P (typically for points) egion Query Containment Query Enclosure Query P Point Query Window Query P 2-nn Query CMPUT 391 Database Management Systems University of Alberta 19 CMPUT 391 Database Management Systems University of Alberta 20

Basic Spatial Queries Spatial Join Given two sets of spatial objects (typically minimum bounding rectangles) S 1 = { 1, 2,, m } and S 2 = { 1, 2,, n } Spatial Join: Compute all pairs of objects (, ) such that S 1, S 2, and intersects ( ) Spatial predicates other than intersection are also possible, e.g. all pairs of objects that are within a certain distance from each other {,, A6} {B1,, B3} A5 B2 A6 B1 A2 Spatial-Join A4 B3 Answer Set (A5, B1) (A4, B1) (, B2) (A6, B2) (A2, B3) Index Support for Spatial Queries Conventional index structures such as B-trees are not designed to support spatial queries Group objects only along one dimension Do not preserve spatial proximity E.g. nearest neighbor query: Nearest neighbor of Q is typically not the nearest neighbor in any single dimension C D NN(Q) A B A and B closer in the dimension; C and D closer in the dimension. Q CMPUT 391 Database Management Systems University of Alberta 21 CMPUT 391 Database Management Systems University of Alberta 22 Index Support for Spatial Queries Spatial index structures try to preserve spatial proximity Group objects that are close to each other on the same data page Problem: the number of bytes to store extended spatial objects (lines, polygons) varies Solution: Store Approximations of spatial objects in the index structure, typically axis-parallel minimum bounding rectangles (MB) Exact object representation (E) stored separately; pointers to E in the index E MB... (MB,,...) (MB,,...)... (E) Spatial Index Query Processing Using Approximations Two-Step Procedure 1. Filter Step: Use the index to find all approximations that satisfy the query Some objects already satisfy the query based on the approximation, others have to be checked in the refinement step Candidate Set 2. efinement Step: Load the exact object representations for candidates left after the filter step and test whether they satisfies the query a c f d b e g Query query-window Window Why? a und sind sicher Antworten - a and b are certain answers f, d und sind sicher keine Antworten - f, d, and g are certainly c und e sind Kandidaten not Query c ist answers ein Fehltreffer (false hit, d. h. ein - c and e Kandidat, are candidates der keine Antwort ist) -c is a false hit Filter candidates Not an answer efinement (exact evaluation) false hits final results CMPUT 391 Database Management Systems University of Alberta 23 CMPUT 391 Database Management Systems University of Alberta 24

Spatial Data Management Shortcomings of elational Databases Modeling Spatial Data Spatial Queries Space-Filling Curves + B-Trees -trees Embedding of the 2-dimensional space into a 1 dimensional space Basic Idea: The data space is partitioned into rectangular cells. Use a space filling curve to assign cell numbers to the cells (define a linear order on the cells) The curve should preserve spatial proximity as good as possible Cell numbers should be easy to compute Objects are approximated by cells. Store the cell numbers for objects in a conventional index structure with respect to the linear order 21 20 17 16 5 4 1 0 23 22 19 18 7 6 3 2 29 28 25 24 13 12 9 8 31 30 27 26 15 14 11 10 53 52 49 48 37 36 33 32 55 54 51 50 39 38 35 34 61 60 57 56 45 44 41 40 63 62 59 58 47 46 43 42 CMPUT 391 Database Management Systems University of Alberta 25 CMPUT 391 Database Management Systems University of Alberta 26 Space Filling Curves Z-Order Z-Values Lexicographic Order Z-Order 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 21 23 29 31 53 55 61 63 20 22 28 30 52 54 60 62 17 19 25 27 49 51 57 59 16 18 24 26 48 50 56 58 Hilbert-Curve 63 62 49 48 47 44 43 42 60 61 50 51 46 45 40 41 59 56 55 52 33 34 39 38 58 57 54 53 32 35 36 37 5 6 9 10 31 28 27 26 4 7 8 11 30 29 24 25 3 2 13 12 17 18 23 22 0 1 14 15 16 19 20 21 Coding of Cells Partition the data space recursively into two halves Alternate and dimension Left/bottom 0 ight/top 1 Z-Value: (c, l) c = decimal value of the bit string l = level (number of bits) 1 1 1 0 0 0 0 1 0 1 0 1 0111 7 010000 16 10 2 5 7 13 15 37 39 45 47 4 1 0 6 12 14 36 38 44 46 3 9 11 33 35 41 43 2 8 10 32 34 40 42 if all cells are on the same level, then l can be omitted Z-Order preserves spatial proximity relatively good Z-Order is easy to compute CMPUT 391 Database Management Systems University of Alberta 27 CMPUT 391 Database Management Systems University of Alberta 28

Z-Order epresentation of Spatial Objects For Points Use a fixed a resolution of the space in both dimensions, i.e., each cell has the same size Each point is then approximated by one cell For extended spatial object minimum enclosing cell Problems with cells that intersect the first partitions already improvement: use several cells Better approximation of the objects edundant storage edundant retrieval in spatial queries C C 1 C 2 C4 C 3 Coding of by one cell 21 23 29 31 53 55 61 63 20 22 28 30 52 54 60 62 17 19 25 27 49 51 57 59 16 18 24 26 48 50 56 58 5 7 13 15 37 39 45 47 4 6 12 14 36 38 44 46 1 3 9 11 33 35 41 43 0 2 8 10 32 34 40 42 Coding of by several cells Query Window Query returns the same answer several times Z-Order Mapping to a B + -Tree Linear Order for Z-values to store them in a B + -tree: Let (c 1, l 1 ) and (c 2, l 2 ) be two Z-Values and let l = min{l 1, l 2 }. The order relation Z (that defines a linear order on Z-values) is then defined by (c 1, l 1 ) Z (c 2, l 2 ) iff (c 1 div 2 ) (c 2 div 2 ) Examples: (1,2) Z (3,2), (3,4) Z (3,2), (1,2) Z (10,4) (l 1 - l) (l 2 - l) CMPUT 391 Database Management Systems University of Alberta 29 CMPUT 391 Database Management Systems University of Alberta 30 Mapping to a B + -Tree - Example Mapping to a B + -Tree Window Query (0,2) (2,3) (6,4) (7,4) (7,4) (4,3) (20,5) (6,3) (21,5) (11,4) (0,2) (7,4) (7,4) (6,3) (2,3) (7,4) (4,3) (6,3) (6,4) (7,4) (20,5) (6,3) (6,3) (6,3) (7,3)... Exact representations stored in a different location Window Query ange Query in the B+-tree find all entries (Z-Values) in the range [l, u] where l = smallest Z-Value of the window (bottom left corner) u = largest Z-Value of the window (top right corner) l and u are computed with respect to the maximum resolution/length of the Z-values in the tree (here: 6) (0,2) (2,3) (10,6) (2,3) (6,4) esult: (0,2) (7,4) (7,4) (4,3) (20,5) (6,3) (21,5) (11,4) (6,3) (6,3) (7,3) Window: Min = (0,6), Max = (10,6) CMPUT 391 Database Management Systems University of Alberta 31 CMPUT 391 Database Management Systems University of Alberta 32

Spatial Data Management Shortcomings of elational Databases Modeling Spatial Data Spatial Queries Space-Filling Curves + B-Trees -trees The -Tree Properties Balanced Tree designed to organize rectangles [Gut 84]. Each page contains between m and M entries. Data page entries are of the form (MB, PointerToExactepr). MB is a minimum bounding rectangle of a spatial object, which PointerToExactepr is pointing to Directory page entries are of the form (MB, PointerToSubtree). MB is the minimum bounding rectangle of all entries in the subtree, which PointerToSubtree is pointing to. ectangles can overlap The height h of an -Tree for N spatial objects: h log m N + 1 Directory Level 1 Directory Level 2 Data Pages... Exact epresentations CMPUT 391 Database Management Systems University of Alberta 33 CMPUT 391 Database Management Systems University of Alberta 34 The -Tree Queries The -Tree Queries. A5 A6 A5 A6 S A4 A2 T S A4 A2 T Point Query A2 A6 Window Query A2 A6 T S A4 T S A4 Paths that the query has to follow A5 A5 Answer Set: [] Answer Set: [A2, ] PointQuery (Page, Point); FO ALL Entry Page DO IF Point IN Entry.MB THEN IF Page = DataPage THEN PointInPolygonTest (load(entry.exactepr), Point) ELSE PointQuery (Entry.Subtree, Point); First call: Page = oot of the -tree Window Query (Page, Window); FO ALL Entry Page DO IF Window INTESECTS Entry.MB THEN IF Page = DataPage THEN Intersection (load(entry.exactepr), Window) ELSE WindowQuery (Entry.Subtree, Window); CMPUT 391 Database Management Systems University of Alberta 35 CMPUT 391 Database Management Systems University of Alberta 36

-Tree Construction Optimization Goals Overlap between the MBs spatial queries have to follow several paths try to minimize overlap Empty space in MB spatial queries may have to follow irrelevant paths try to minimize area and empty space in MBs -Tree Construction Important Issues Split Strategy? Split into 2 pages A5 A4 S S A4 A5 How to divide a set of rectangles into 2 sets? M = 3, m = 2 A5 A4 Start: empty data page (= root) Insert: A5,,, A4 * A5,,, A4 (overflow) Insertion Strategy A5 A4 S A2 A4? Insert A2 S? A2 A5 Where to insert a new rectangle? CMPUT 391 Database Management Systems University of Alberta 37 CMPUT 391 Database Management Systems University of Alberta 38 -Tree Construction Insertion Strategies Dynamic construction by insertion of rectangles Searching for the data page into which will be inserted, traverses the tree from the root to a data page. When considering entries of a directory page P, 3 cases can occur: 1. falls into exactly one Entry.MB follow Entry.Subtree 2. falls into the MB of more than one entry e 1,..., e n follow E i.subtree for entry e i with the smallest area of e i.mb. 3. does not fall into an Entry.MB of the current page check the increase in area of the MB for each entry when enlarging the MB to enclose. Choose Entry with the minimum increase in area (if this entry is not unique, choose the one with the smallest area); enlarge Entry.MB and follow Entry.Subtree Construction by bulk-loading the rectangles Sort the rectangles, e.g., using Z-Order Create the -tree bottom-up -Tree Construction Split Insertion will eventually lead to an overflow of a data page The parent entry for that page is deleted. The page is split into 2 new pages - according to a split strategy 2 new entries pointing to the newly created pages are inserted into the parent page. A now possible overflow in the parent page is handled recursively in a similar way; if the root has to be split, a new root is created to contain the entries pointing to the newly created pages. A5 A2A6 A4 S S A4 A2 A5 A6 * Overlow split node M = 3, m = 2 U A5 A2A6 V A4 S A4 S U V A5 A2 A6 CMPUT 391 Database Management Systems University of Alberta 39 CMPUT 391 Database Management Systems University of Alberta 40

-Tree Construction Splitting Strategies Overflow of node K with K = M+1 entries Distribution of the entries into two new nodes K 1 and K 2 such that K 1 m and K 2 m Exhaustive algorithm: Searching for the best split in the set of all possible splits is too expensive (O(2 M ) possibilities!) Quadratic algorithm: Choose the pair of rectangles 1 and 2 that have the largest value d( 1, 2 ) for empty space in an MB, which covers both 1 und 2. d ( 1, 2 ) := Area(MB( 1 2 )) (Area( 1 ) + Area( 2 )) Set K 1 := { 1 } and K 2 := { 2 } epeat until STOP if all i are assigned: STOP if all remaining i are needed to fill the smaller node to guarantee minimal occupancy m: assign them to the smaller node and STOP else: choose the next i and assign it to the node that will have the smallest increase in area of the MB by the assignment. If not unique: choose the K i that covers the smaller area (if still not unique: the one with less entries). -Tree Construction Splitting Strategies Linear algorithm: Same as the quadratic algorithm, except for the choice of the initial pair: Choose the pair with the largest distance. For each dimension determine the rectangle with the largest minimal value and the rectangle with the smallest maximal value (the difference is the maximal distance/separation). Normalize the maximal distance of each dimension by dividing by the sum of the extensions of the rectangles in this dimension Choose the pair of rectangles that has the greatest normalized distance. Set K 1 := { 1 } and K 2 := { 2 }. Largest minimal value in dimension Smallest maximal value in dimension Smallest maximal value in dimension Largest minimal value in dimension max. distance for max. distance for CMPUT 391 Database Management Systems University of Alberta 41 CMPUT 391 Database Management Systems University of Alberta 42 -Trees Variants Many variants of -trees exist, e.g., the *-tree, -tree for higher dimensional point data, For further information see http://www.cs.umd.edu/~hjs/rtrees/index.html (includes an interactive demo) -trees are also efficient index structures for point data since points can be modeled as degenerated rectangles Multi-dimensional points, where a distance function between the points is defined play an important P 1 role for similarity search in so-called feature or multi-media databases. P 2 P 3 P 4 P 5 Examples of Feature Databases Measurements for celestial objects (e.g., intensity of emission in different wavelengths) Colour histograms of images Documents, shape descriptors, nd-dimensional feature vectors (o1 1, o1 2,, o1 d ) (o2 1, o2 2,, o2 d )... (on 1, on 2,, on d ) P 6 P 7 P 8 P 9 P 10 P 11 P 12 P 13 CMPUT 391 Database Management Systems University of Alberta 43 CMPUT 391 Database Management Systems University of Alberta 44

Feature Databases and Similarity Queries Objects + Metric Distance Function The distance function measures (dis)similarity between objects Basic types of similarity queries range queries with range ε etrieves all objects which are similar to the query object up to a certain degree ε k-nearest neighbor queries etrieves k most similar objects to the query query range ε query object 3-nn distance CMPUT 391 Database Management Systems University of Alberta 45