Indexing Methods. Lecture 9. Storage Requirements of Databases

Indexing Methods Lecture 9 Storage Requirements of Databases Need data to be stored permanently or persistently for long periods of time Usually too big to fit in main memory Low cost of storage per unit of data and the definition of very large databases Main cost incurred after storage is of searching the database Primary and secondary (auxiliary) file organizations

File Organizations Relations usually stored in files as logical records and read in terms of physical blocks File organization refers to the way records are stored in terms of blocks and the way blocks are placed on the storage medium and interlinked Types of organizations Unsorted Sorted Hashing Records Represents a tuple in a relation A file is a sequence of records Records could be either fixed-length or variable-length Records comprise of a sequence of fields (column, attribute)

Blocks Refer to physical units of storage in storage devices (Example: Sectors in hard disks, page in virtual memory) Of fixed length, based on physical characteristics of the storage/computing device and operating system Storage device is either defragmented or fragmented depending on whether contiguous sets of records lie in contiguous blocks Blocking Factor The number of records that are stored in a block is called the blocking factor Blocking factor is constant across blocks if record length is fixed, or variable otherwise If B is block size and R is record size, then blocking factor is: bfr = B/R Since R may not exactly divide B, there could be some left-over space in each block equal to: B (bfr * R) bytes 3

Spanned and Unspanned Records When extra space in blocks are left unused, the record organization is said to be unspanned Record Record Record 3 Unused Spanned and Unspanned Records In spanned record storage, records can be split so that the span across blocks Record Record Record 3 Block m p Record 4 (part) Record 4 (remaining) Block p 4

Spanned and Unspanned Records When record size is greater than block size (ie R > B), use of spanned record storage is compulsory Indexes Index Files Secondary or auxiliary files that help speed up data access in primary files Indexes or access structures Data structures (and search methods) used for fast access Single level index index file maps directly to the block or the address of the record Multi-level index multiple levels of indirection among indexes 5

Definitions Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field) Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations) Definitions Clustering index: When the ordering field is not a key field (ie not unique) a clustering index is used instead of a primary index Secondary index: An index structure defined on a non-ordering field 6

Primary Indexes Comprises of an ordered file of fixed length records having two fields The first field of same data type as ordering key (primary key), and second field is of the type block address Primary index records are represented by a pair: (k(i), a(i)) Where k(i) is the key for the i th record and a(i) is the block address containing the i th record Index File 003-00 003-0 003-08 003-00 003-0 003-04 003-06 K(i) a(i) Primary Index RollNo Name Age Gender Grade 003-00 003-00 003-0 003-040 003-0 003-040 003-06 003-080 7

Primary Index The number of entries in the index is equal to the number of disk blocks in the ordered data file The first record in each block of the file is indexed (in sparse indexes) These records are called anchor records A sparse index has index entries for only some of the search values A dense index has an index for every search key value (every record in the data file) Dense indexes are not beneficial on ordered data files Primary Index Search: Easy Perform Binary Search on index file to identify block containing required record Insertion / Deletion: Easy if key values in records are fixed length and statically allocated to blocks without block spanning (results in wasted space however) Else, re-computation of index required on insertion / deletion Use of overflow buffers may be necessary 8

Clustering Index Clustering field: A non-key ordering field That is, blocks are ordered on this field which does not have the UNIQUE constraint Structure of index file similar to primary index file, but each index points to the first block having the given value in its clustering field One index entry for every distinct value of the clustering field K(I) 3 30 39 80 90 A(I) Clustering Index Dept No Name Gender DOB 3 3 80 80 8 89 89 90 Job 9

Clustering Index A sparse index, since only distinct values are indexed Insertion and deletion cause problems when a block can hold more than one value for clustering field Alternative solution: Allocate blocks for each value of clustering field K(I) 3 30 39 80 90 A(I) Clustering Index Dept No Name Gender DOB 80 80 89 89 89 More fields More fields More 89 fields Job 0

Secondary Index Used to index fields that are neither ordering fields nor key fields Many secondary indexes possible on a single file One index entry for the every record in the data file (dense index), containing the value of the indexed attribute, and a pointer to the block / record Secondary Index on Key Field K(i), A(i) 003-00 003-00 003-003 003-004 003-005 003-006 003-007 RollNo Name Age Dept No Job 003-00 003-007 003-003 003-00 003-005 003-004 003-006 Has as many index entries as the number of records

Secondary Index on Key Field Since key fields are unique, number of index entries equal to number of records Data file need not be sorted on disk Fixed length records for index file Secondary Index on non-key Field When a non-key field is indexed, duplicate values have to be handled There are three different techniques for handling duplicates: Duplicate index entries Variable length records Extra redirection levels

Duplicate Index Entries K(i) 003-00 003-00 003-00 003-00 003-00 003-003 003-003 A(i) Index entries are repeated for each duplicate occurrence of the non-key attribute Binary search becomes more complicated Mid-point of a search may have duplicate entries on either side Insertion of records may need restructuring of index table Variable Length Records Use variable length records for index table in order to accommodate duplicate key entries For a given key K(i), there is a set of address pointers instead of a single address pointer Binary search becomes complicated since address mid points cannot be computed efficiently Insertion of records may need restructuring of the index table 3

K(I) 3 4 Extra Redirection Levels A(I) RollNo Name Age LabId Grade 3 4 Address Blocks 3 4 Extra Indirection Levels Most frequently used technique Index records are of fixed length A(i) in an index record points to a block of address fields Block overflows handled by chaining Retrieval requires sequential search within blocks Insertion of records straightforward 4

Multi-level Indexes Binary search in single-level indexes require a search time of the order of log b number of block accesses Here b is the number of blocks in the index file If the bfr of the index file is greater than, number of block accesses can be reduced even further Multi-level indexes are meant for such a reduction Multi-level Indexes Contains several levels of the index file Each index block at a given level connects to a maximum of fo number of blocks at the next level Here fo is called the fan out of the index structure Block accesses reduced from log b to log fo b on an average 5

A Two-level Index Structure First (base) level 4 5 8 0 Second (top) level 5 0 5 Block Block 0 5 8 Two-level Index Structure First (Base) level is the usual primary index that is maintained in a sorted file Second (Top) level is a primary index into the first level index file The process can be repeated to any number of levels Each level reduces the number of entries of its next level by a factor of fo 6

Summary Types of Indexes Key field Non-key field Ordering Field Primary index Clustering index Nonordering Field Secondary index (key) Secondary index (non-key) Properties of Indexes Primary Clustering Secondary (key) Secondary (nonkey) Summary Number of (firstlevel) index entries Number of blocks in data file Number of distinct index field values Number of records in data file Number of records or number of distinct field values Dense or nondense Non-dense Non-dense Dense Dense or nondense 7

Summary Multi-level indexes: Several level of index files Characteristic fan out property Fan out fo preferably greater than Reduces number of block accesses to order of log fo b Dynamic Multi-level Indexes 8

Overview of Index Structures Index Files Secondary or auxiliary files that help speed up data access in primary files Indexes or access structures Data structures (and search methods) used for fast access Single level index index file maps directly to the block or the address of the record Multi-level index multiple levels of indirection among indexes Definitions Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field) Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations) 9

Definitions Clustering index: When the ordering field is not a key field (ie not unique) a clustering index is used instead of a primary index Secondary index: An index structure defined on a non-ordering field 003-08 003-00 Primary Index Illustration Index File 003-00 003-0 003-0 003-04 003-06 K(i) a(i) RollNo Name Age Gender Grade 003-00 003-00 003-0 003-040 003-0 003-040 003-06 003-080 0

Clustering Index Illustration K(I) 3 30 39 80 90 A(I) Dept No Name Gender DOB 3 3 80 80 8 89 89 90 Job Secondary Index on Key Field K(i), A(i) 003-00 003-00 003-003 003-004 003-005 003-006 003-007 RollNo Name Age Dept No Job 003-00 003-007 003-003 003-00 003-005 003-004 003-006 Has as many index entries as the number of records

K(I) 3 4 Secondary Index on non-key Field A(I) RollNo Name Age LabId Grade 3 4 Address Blocks 3 4 Summary Types of Indexes Key field Non-key field Ordering Field Primary index Clustering index Nonordering Field Secondary index (key) Secondary index (nonkey)

Properties of Indexes Primary Clustering Secondary (key) Secondary (nonkey) Summary Number of (firstlevel) index entries Number of blocks in data file Number of distinct index field values Number of records in data file Number of records or number of distinct field values Dense or nondense Non-dense Non-dense Dense Dense or nondense Multi-level Indexes Binary search in single-level indexes require a search time of the order of log b number of block accesses Here b is the number of blocks in the index file If the bfr of the index file is greater than, number of block accesses can be reduced even further Multi-level indexes are meant for such a reduction 3

Multi-level Indexes Contains several levels of the index file Each index block at a given level connects to a maximum of fo number of blocks at the next level Here fo is called the fan out of the index structure Block accesses reduced from log b to log fo b on an average A Two-level Index Structure First (base) level 4 5 8 0 Second (top) level 5 0 5 Block Block 0 5 8 4

Two-level Index Structure First (Base) level is the usual primary index that is maintained in a sorted file Second (Top) level is a primary index into the first level index file The process can be repeated to any number of levels Each level reduces the number of entries of its next level by a factor of fo A Two-level Index Structure First (base) level 4 5 8 0 Second (top) level 5 0 5 Block Block 0 5 8 5

Two-level Index Structure First (Base) level is the usual primary index that is maintained in a sorted file Second (Top) level is a primary index into the first level index file The process can be repeated to any number of levels Each level reduces the number of entries of its next level by a factor of fo Balanced and Unbalanced Index Trees Unbalanced O(n) Balanced θ(log fo n) 6

Insertions and Deletions Balanced property of index trees should be maintained during insertions and deletions Insertions and deletions are problematic in multi-level index, since all index files are physically sorted files An approach to overcome this is to use dynamic multi-level indexes B-Trees A Tree data structure where each node has a predetermined maximum fan-out p Terminologies: root node, leaf nodes, internal nodes, parent, children 7

Structure of a Node K K K i- K i Data Pointer K < X < K X < K Left-most Subtree Data Pointer X > K Right-most Subtree B-Tree constraints For a node containing p- (or p sub trees) keys, the following condition must always hold: K < K < < K p- For any data element X in subtree Pi, it should always be the case that: K i- < X < K i, K < X and K p- > X 8

B-Tree Constraints Each node has at most p tree pointers Each node, except the root and leaf nodes, has at least p/ tree pointers (tree balancing constraint) The root node has at least tree pointers unless it is the only node in the tree All leaf nodes are at the same level In a leaf node, all tree pointers are null B + Trees Most common index structures in RDBMS Leaf and non-leaf nodes have different structures: data pointers are stored only at the leaf nodes Leaf nodes form a sense index containing every entry for the search field and its corresponding record pointer Leaf nodes linked to provide ordered access to data file records 9

Non-leaf Nodes in B + Trees K K K i- K i X < K K < X < K Left-most Subtree X > K Right-most Subtree Leaf Nodes in B + Trees K K K i- K i Data pointer Data pointer Data pointer Data pointer Pointer to next leaf node in tree 30

Properties of Leaf Nodes Keys along the leaf nodes chain is organized in sorted order K < K < < K n Each leaf node has at least p/ values All leaf nodes are at the same level Searching in B + Trees Generalization of Binary Search Given a search key k start from the root node If key is present in current node then success; else 3 If current node is a leaf node and key not present in node, then key not in the database 4 Search for a tree pointer Pi such that K i- < k k i 5 Return to step to continue search 3

Insertion Originally, tree begins with only the root node As and when nodes fill up, they are split and made children of a new node Keys are split uniformly across the three nodes Insertion Let p = Let insertion sequence of keys be: 5, 8, 3, 7,, 9, 7, 0, 5 8 Tree, after insertion of 5 and 8 Insertion of next key 3 causes overflow requiring a split 3

Insertion 5 3 5 8 7 is inserted into this node No overflow Insertion 5 3 5 7 8 Insertion of causes overflows that need to be cascaded to upper levels 33

Insertion 3 7 3 5 7 8 Insertion of 9 Insertion 5 3 8 3 5 7 8 9 34

Deletion Deletion of keys may cause underflows which have to be handled separately An underflow occurs when a node contains less than p/ keys Nodes are merged with their siblings when underflows occur Indexes on Multiple Attributes All index structures explored till now assumes simple attributes: comprising of only one value Many applications require multi-attribute (composite) keys 35

Ordered Index on Multiattributes Considers a composite key as a tuple of simple keys (k, k, k n ) Ordered index files maintained by ordering each key in sequence Partitioned Hashing Given a composite key (k, k, k n ), partitioned hashing returns n different bucket numbers Hash bucket is determined by concatenating the n numbers 36

Grid Files Partitions the range of key values for each key into several buckets Combinations of buckets of each key forms a grid A grid file stores a grid in either a row major or a column major form Grade A B C Grid Files D Roll No 3 4 5 Bucket Pool Roll No 00 05 06 050 3 05 075 4 076 00 5 0 5 37

Summary Multi-level Indexes Trees, root node, leaf nodes, non-leaf (internal) nodes Dynamic multi-level indexes, B-trees and B + trees Insertion and deletion in B + trees Indexes on multiple attributes 38