Multidimensional Data and Modelling (grid technique)

Multidimensional Data and Modelling (grid technique) 1

Grid file Increase of database usage and integrated information systems File structures => efficient access to records How? Combine attribute values (multikeys) But traditional file structures that provide multikey access to records are extensions of file structures originally designed for single-key access. Thus, they manifest various deficiencies in particular for multikey access to highly dynamic files 2

Grid file l problem: spatial queries in k-d point-sets l main idea: try to generalize hashing to k-d

Initial approach of locational data (0,100) (100,100) (62,77) Toronto (82,65) Buffalo (35,42) Chicago y (27,35) Omaha (45,30) Memphis (0,0) x (52,10) Mobile (85,15) Atlanta (90,5) Miami (100,0) 4

Traditional single-key access

Grid file l Special kind of hashing l Adaptable: w.r.t. insert/delete l Efficient query handling l Dynamic : Access time is uniform (two-disk-access principle) l Symmetric: No Secondary Key. Every key is the Primary Key l Multikey: records using subset of keys

Grid file l A: put a grid l specs: [Nievergelt +, 84] Jurg Nievergelt l symmetric to all attributes l 2 disk accesses for exact match queries l adaptive to non-uniform distr. l Q: details?

Grid file

Grid file l Useful for range queries that would map into a set of cells corresponding to a group of values along the linear scales. l Can be applied to any number of search keys. l n search keys => n dimensions. l They perform well in terms of reduction in time for multiple key access.

Grid file How? l Divide record space into grid blocks

Grid file l Allocates storage in units of fixed size l Disk blocks/pages/buckets l To map grid blocks to buckets? l Use grid directory l Two-disk-access: Retrieve single record in at-most 2 disk access l Access directory(grid) l Access Bucket(database) l Efficient range queries

Grid Directory (k=2)

Single Record Access [1980,w]

Range Query l [1450-1600, c-g,, ] l Different buckets?

Next in each direction l l l l Nextxabove: cx = (cx+1) mod nx Nextxbelow: cx = (cx-1) mod nx Nextyabove: cy = (cy+1) mod ny Nextybelow: cy = (cy-1) mod ny

Insertion l Bucket size = 4 l Split it!!!!

Grid File Insertion

Grid File Insertion l Fixed scheduled Dimension splitting is used in this example

Directory Merging l No queries between [a-k] and [0-1500]

Directory Merging l Grid directory is trimmed on merging

Concurrent Access l No root node as in trees(bottleneck if present), allows concurrency

Advantages l No special computations are required l Only the right records are retrieved l Can also be used for single search key queries l Easy to extend to queries on n search keys l Significant improvement in processing time for multiple-key queries l Has a two-disk-access upper bound for accessing data l Allows simpler concurrency control protocols

Grid files - disadvantages l #1: problems in high-d: directory splits can be expensive l #2: even in low-d, suffers on correlated attributes

Grid files - disadvantages l (A1: rotate; A2: triangular cells)

Grid files - disadvantages l #3: how about region data?

Grid files - disadvantages l #3: how about region data? l if we cut them, then we have O(volume) pieces (while z-ordering: O(surface)) l Translation to 2k d points! (clever, BUT, still has subtle problems) E.g., 1-d regions A B C 0 ¼ ¾ 1 ½ x-end A B C 0 ¼ ¾ 1 ½ x-start

Grid files - disadvantages l what to do? l Translation to 2kd points! (clever, BUT, still has subtle problems) E.g., 1-d regions A B C x-end A C B 0 ¼ ¾ 1 ½ 0 ¼ ¾ 1 ½ x-start

Disadvantages l dimensionality curse; large query regions l imposes space overhead l performance overhead on insertion and deletion l a frequent reorganization of the file adds to the maintenance cost

Bang file

Two-level grid file Two-Level Grid File

Twin grid file Given set of points can be distributed among two grid files in such a way that storage space utilization is optimal. The optimal twin grid file can be built practically as fast as a standard grid file, i.e. the storage space optimality is obtained at almost no extra cost.

Twin grid file The performances of the standard grid file, the optimal static twin grid file, and an efficient dynamic twin grid file, where insertions and deletions trigger the redistribution of points among the two grid files. Twin grid files utilize storage space at roughly 90%, as compared with the 69% of the standard grid file. Typical range queries - the most important spatial search operations - can be answered in twin grid files at least as fast as in the standard grid file.

Buddy tree The buddy tree is a dynamic hashing scheme with a treelike directory. The universe is cut recursively into two parts of equal size with iso-oriented hyperplanes, and each interior node corresponds to a partition together with interval. The interval corresponds to MBB, covering points below of given node. Also: l Each directory node contains at least two entries; l Whenever a node is split, the MBB and subnodes are recomputed, to fit situation; l Except for the root of the directory, there is exactly one pointer referring to each directory page.

Buddy tree