Indexing Methods. Lecture 9. Storage Requirements of Databases

Similar documents
Chapter 17 Indexing Structures for Files and Physical Database Design

Chapter 18. Indexing Structures for Files. Chapter Outline. Indexes as Access Paths. Primary Indexes Clustering Indexes Secondary Indexes

Database Systems. File Organization-2. A.R. Hurson 323 CS Building

Remember. 376a. Database Design. Also. B + tree reminders. Algorithms for B + trees. Remember

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

Database files Organizations Indexing B-tree and B+ tree. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 14-1

The physical database. Contents - physical database design DATABASE DESIGN I - 1DL300. Introduction to Physical Database Design

Indexes as Access Paths

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Chapter 18 Indexing Structures for Files

Chapter 12: Indexing and Hashing

Chapter 11: Indexing and Hashing

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Index Structures for Files

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

CSIT5300: Advanced Database Systems

Intro to DB CHAPTER 12 INDEXING & HASHING

Chapter 11: Indexing and Hashing

amiri advanced databases '05

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Background: disk access vs. main memory access (1/2)

Physical Level of Databases: B+-Trees

Database index structures

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Storage hierarchy. Textbook: chapters 11, 12, and 13

Indexing: Overview & Hashing. CS 377: Database Systems

Indexing and Hashing

CS 245 Midterm Exam Winter 2014

(i) It is efficient technique for small and medium sized data file. (ii) Searching is comparatively fast and efficient.

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

CSC 261/461 Database Systems Lecture 17. Fall 2017

Some Practice Problems on Hardware, File Organization and Indexing

Database Systems II. Record Organization

Material You Need to Know

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Indexing: B + -Tree. CS 377: Database Systems

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Representing Data Elements

Data Structures and Algorithms

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Chapter 1 Disk Storage, Basic File Structures, and Hashing.

Darshan Institute of Engineering & Technology

Chapter 12: Indexing and Hashing (Cnt(

CSE 530A. B+ Trees. Washington University Fall 2013

CSC 553 Operating Systems

B-Tree. CS127 TAs. ** the best data structure ever

Lecture 13. Lecture 13: B+ Tree

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Physical Database Design: Outline

Overview of Storage and Indexing

Data on External Storage

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Database System Concepts

Chapter 13: Indexing. Chapter 13. ? value. Topics. Indexing & Hashing. value. Conventional indexes B-trees Hashing schemes (self-study) record

CS-245 Database System Principles

Kathleen Durant PhD Northeastern University CS Indexes

Data Management for Data Science

Multi-way Search Trees! M-Way Search! M-Way Search Trees Representation!

Storing Data: Disks and Files

Advanced Database Systems

File Management. Chapter 12

Chapter 13: Query Processing

QUIZ: Buffer replacement policies

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Physical Database Design

CSE 444: Database Internals. Lectures 5-6 Indexing

COMP 430 Intro. to Database Systems. Indexing

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Chapter 12: Query Processing. Chapter 12: Query Processing

File Organization and Storage Structures

Chapter 12: Query Processing

Multidimensional Indexes [14]

File Structures and Indexing

ACCESS METHODS: FILE ORGANIZATIONS, B+TREE

Advanced Database Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Data Storage and Query Answering. Data Storage and Disk Structure (4)

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Tree-Structured Indexes

CSC Design and Analysis of Algorithms. Lecture 7. Transform and Conquer I Algorithm Design Technique. Transform and Conquer

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

CS 525: Advanced Database Organization 04: Indexing

Transcription:

Indexing Methods Lecture 9 Storage Requirements of Databases Need data to be stored permanently or persistently for long periods of time Usually too big to fit in main memory Low cost of storage per unit of data and the definition of very large databases Main cost incurred after storage is of searching the database Primary and secondary (auxiliary) file organizations

File Organizations Relations usually stored in files as logical records and read in terms of physical blocks File organization refers to the way records are stored in terms of blocks and the way blocks are placed on the storage medium and interlinked Types of organizations Unsorted Sorted Hashing Records Represents a tuple in a relation A file is a sequence of records Records could be either fixed-length or variable-length Records comprise of a sequence of fields (column, attribute)

Blocks Refer to physical units of storage in storage devices (Example: Sectors in hard disks, page in virtual memory) Of fixed length, based on physical characteristics of the storage/computing device and operating system Storage device is either defragmented or fragmented depending on whether contiguous sets of records lie in contiguous blocks Blocking Factor The number of records that are stored in a block is called the blocking factor Blocking factor is constant across blocks if record length is fixed, or variable otherwise If B is block size and R is record size, then blocking factor is: bfr = B/R Since R may not exactly divide B, there could be some left-over space in each block equal to: B (bfr * R) bytes 3

Spanned and Unspanned Records When extra space in blocks are left unused, the record organization is said to be unspanned Record Record Record 3 Unused Spanned and Unspanned Records In spanned record storage, records can be split so that the span across blocks Record Record Record 3 Block m p Record 4 (part) Record 4 (remaining) Block p 4

Spanned and Unspanned Records When record size is greater than block size (ie R > B), use of spanned record storage is compulsory Indexes Index Files Secondary or auxiliary files that help speed up data access in primary files Indexes or access structures Data structures (and search methods) used for fast access Single level index index file maps directly to the block or the address of the record Multi-level index multiple levels of indirection among indexes 5

Definitions Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field) Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations) Definitions Clustering index: When the ordering field is not a key field (ie not unique) a clustering index is used instead of a primary index Secondary index: An index structure defined on a non-ordering field 6

Primary Indexes Comprises of an ordered file of fixed length records having two fields The first field of same data type as ordering key (primary key), and second field is of the type block address Primary index records are represented by a pair: (k(i), a(i)) Where k(i) is the key for the i th record and a(i) is the block address containing the i th record Index File 003-00 003-0 003-08 003-00 003-0 003-04 003-06 K(i) a(i) Primary Index RollNo Name Age Gender Grade 003-00 003-00 003-0 003-040 003-0 003-040 003-06 003-080 7

Primary Index The number of entries in the index is equal to the number of disk blocks in the ordered data file The first record in each block of the file is indexed (in sparse indexes) These records are called anchor records A sparse index has index entries for only some of the search values A dense index has an index for every search key value (every record in the data file) Dense indexes are not beneficial on ordered data files Primary Index Search: Easy Perform Binary Search on index file to identify block containing required record Insertion / Deletion: Easy if key values in records are fixed length and statically allocated to blocks without block spanning (results in wasted space however) Else, re-computation of index required on insertion / deletion Use of overflow buffers may be necessary 8

Clustering Index Clustering field: A non-key ordering field That is, blocks are ordered on this field which does not have the UNIQUE constraint Structure of index file similar to primary index file, but each index points to the first block having the given value in its clustering field One index entry for every distinct value of the clustering field K(I) 3 30 39 80 90 A(I) Clustering Index Dept No Name Gender DOB 3 3 80 80 8 89 89 90 Job 9

Clustering Index A sparse index, since only distinct values are indexed Insertion and deletion cause problems when a block can hold more than one value for clustering field Alternative solution: Allocate blocks for each value of clustering field K(I) 3 30 39 80 90 A(I) Clustering Index Dept No Name Gender DOB 80 80 89 89 89 More fields More fields More 89 fields Job 0

Secondary Index Used to index fields that are neither ordering fields nor key fields Many secondary indexes possible on a single file One index entry for the every record in the data file (dense index), containing the value of the indexed attribute, and a pointer to the block / record Secondary Index on Key Field K(i), A(i) 003-00 003-00 003-003 003-004 003-005 003-006 003-007 RollNo Name Age Dept No Job 003-00 003-007 003-003 003-00 003-005 003-004 003-006 Has as many index entries as the number of records

Secondary Index on Key Field Since key fields are unique, number of index entries equal to number of records Data file need not be sorted on disk Fixed length records for index file Secondary Index on non-key Field When a non-key field is indexed, duplicate values have to be handled There are three different techniques for handling duplicates: Duplicate index entries Variable length records Extra redirection levels

Duplicate Index Entries K(i) 003-00 003-00 003-00 003-00 003-00 003-003 003-003 A(i) Index entries are repeated for each duplicate occurrence of the non-key attribute Binary search becomes more complicated Mid-point of a search may have duplicate entries on either side Insertion of records may need restructuring of index table Variable Length Records Use variable length records for index table in order to accommodate duplicate key entries For a given key K(i), there is a set of address pointers instead of a single address pointer Binary search becomes complicated since address mid points cannot be computed efficiently Insertion of records may need restructuring of the index table 3

K(I) 3 4 Extra Redirection Levels A(I) RollNo Name Age LabId Grade 3 4 Address Blocks 3 4 Extra Indirection Levels Most frequently used technique Index records are of fixed length A(i) in an index record points to a block of address fields Block overflows handled by chaining Retrieval requires sequential search within blocks Insertion of records straightforward 4

Multi-level Indexes Binary search in single-level indexes require a search time of the order of log b number of block accesses Here b is the number of blocks in the index file If the bfr of the index file is greater than, number of block accesses can be reduced even further Multi-level indexes are meant for such a reduction Multi-level Indexes Contains several levels of the index file Each index block at a given level connects to a maximum of fo number of blocks at the next level Here fo is called the fan out of the index structure Block accesses reduced from log b to log fo b on an average 5

A Two-level Index Structure First (base) level 4 5 8 0 Second (top) level 5 0 5 Block Block 0 5 8 Two-level Index Structure First (Base) level is the usual primary index that is maintained in a sorted file Second (Top) level is a primary index into the first level index file The process can be repeated to any number of levels Each level reduces the number of entries of its next level by a factor of fo 6

Summary Types of Indexes Key field Non-key field Ordering Field Primary index Clustering index Nonordering Field Secondary index (key) Secondary index (non-key) Properties of Indexes Primary Clustering Secondary (key) Secondary (nonkey) Summary Number of (firstlevel) index entries Number of blocks in data file Number of distinct index field values Number of records in data file Number of records or number of distinct field values Dense or nondense Non-dense Non-dense Dense Dense or nondense 7

Summary Multi-level indexes: Several level of index files Characteristic fan out property Fan out fo preferably greater than Reduces number of block accesses to order of log fo b Dynamic Multi-level Indexes 8

Overview of Index Structures Index Files Secondary or auxiliary files that help speed up data access in primary files Indexes or access structures Data structures (and search methods) used for fast access Single level index index file maps directly to the block or the address of the record Multi-level index multiple levels of indirection among indexes Definitions Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field) Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations) 9

Definitions Clustering index: When the ordering field is not a key field (ie not unique) a clustering index is used instead of a primary index Secondary index: An index structure defined on a non-ordering field 003-08 003-00 Primary Index Illustration Index File 003-00 003-0 003-0 003-04 003-06 K(i) a(i) RollNo Name Age Gender Grade 003-00 003-00 003-0 003-040 003-0 003-040 003-06 003-080 0

Clustering Index Illustration K(I) 3 30 39 80 90 A(I) Dept No Name Gender DOB 3 3 80 80 8 89 89 90 Job Secondary Index on Key Field K(i), A(i) 003-00 003-00 003-003 003-004 003-005 003-006 003-007 RollNo Name Age Dept No Job 003-00 003-007 003-003 003-00 003-005 003-004 003-006 Has as many index entries as the number of records

K(I) 3 4 Secondary Index on non-key Field A(I) RollNo Name Age LabId Grade 3 4 Address Blocks 3 4 Summary Types of Indexes Key field Non-key field Ordering Field Primary index Clustering index Nonordering Field Secondary index (key) Secondary index (nonkey)

Properties of Indexes Primary Clustering Secondary (key) Secondary (nonkey) Summary Number of (firstlevel) index entries Number of blocks in data file Number of distinct index field values Number of records in data file Number of records or number of distinct field values Dense or nondense Non-dense Non-dense Dense Dense or nondense Multi-level Indexes Binary search in single-level indexes require a search time of the order of log b number of block accesses Here b is the number of blocks in the index file If the bfr of the index file is greater than, number of block accesses can be reduced even further Multi-level indexes are meant for such a reduction 3

Multi-level Indexes Contains several levels of the index file Each index block at a given level connects to a maximum of fo number of blocks at the next level Here fo is called the fan out of the index structure Block accesses reduced from log b to log fo b on an average A Two-level Index Structure First (base) level 4 5 8 0 Second (top) level 5 0 5 Block Block 0 5 8 4

Two-level Index Structure First (Base) level is the usual primary index that is maintained in a sorted file Second (Top) level is a primary index into the first level index file The process can be repeated to any number of levels Each level reduces the number of entries of its next level by a factor of fo A Two-level Index Structure First (base) level 4 5 8 0 Second (top) level 5 0 5 Block Block 0 5 8 5

Two-level Index Structure First (Base) level is the usual primary index that is maintained in a sorted file Second (Top) level is a primary index into the first level index file The process can be repeated to any number of levels Each level reduces the number of entries of its next level by a factor of fo Balanced and Unbalanced Index Trees Unbalanced O(n) Balanced θ(log fo n) 6

Insertions and Deletions Balanced property of index trees should be maintained during insertions and deletions Insertions and deletions are problematic in multi-level index, since all index files are physically sorted files An approach to overcome this is to use dynamic multi-level indexes B-Trees A Tree data structure where each node has a predetermined maximum fan-out p Terminologies: root node, leaf nodes, internal nodes, parent, children 7

Structure of a Node K K K i- K i Data Pointer K < X < K X < K Left-most Subtree Data Pointer X > K Right-most Subtree B-Tree constraints For a node containing p- (or p sub trees) keys, the following condition must always hold: K < K < < K p- For any data element X in subtree Pi, it should always be the case that: K i- < X < K i, K < X and K p- > X 8

B-Tree Constraints Each node has at most p tree pointers Each node, except the root and leaf nodes, has at least p/ tree pointers (tree balancing constraint) The root node has at least tree pointers unless it is the only node in the tree All leaf nodes are at the same level In a leaf node, all tree pointers are null B + Trees Most common index structures in RDBMS Leaf and non-leaf nodes have different structures: data pointers are stored only at the leaf nodes Leaf nodes form a sense index containing every entry for the search field and its corresponding record pointer Leaf nodes linked to provide ordered access to data file records 9

Non-leaf Nodes in B + Trees K K K i- K i X < K K < X < K Left-most Subtree X > K Right-most Subtree Leaf Nodes in B + Trees K K K i- K i Data pointer Data pointer Data pointer Data pointer Pointer to next leaf node in tree 30

Properties of Leaf Nodes Keys along the leaf nodes chain is organized in sorted order K < K < < K n Each leaf node has at least p/ values All leaf nodes are at the same level Searching in B + Trees Generalization of Binary Search Given a search key k start from the root node If key is present in current node then success; else 3 If current node is a leaf node and key not present in node, then key not in the database 4 Search for a tree pointer Pi such that K i- < k k i 5 Return to step to continue search 3

Insertion Originally, tree begins with only the root node As and when nodes fill up, they are split and made children of a new node Keys are split uniformly across the three nodes Insertion Let p = Let insertion sequence of keys be: 5, 8, 3, 7,, 9, 7, 0, 5 8 Tree, after insertion of 5 and 8 Insertion of next key 3 causes overflow requiring a split 3

Insertion 5 3 5 8 7 is inserted into this node No overflow Insertion 5 3 5 7 8 Insertion of causes overflows that need to be cascaded to upper levels 33

Insertion 3 7 3 5 7 8 Insertion of 9 Insertion 5 3 8 3 5 7 8 9 34

Deletion Deletion of keys may cause underflows which have to be handled separately An underflow occurs when a node contains less than p/ keys Nodes are merged with their siblings when underflows occur Indexes on Multiple Attributes All index structures explored till now assumes simple attributes: comprising of only one value Many applications require multi-attribute (composite) keys 35

Ordered Index on Multiattributes Considers a composite key as a tuple of simple keys (k, k, k n ) Ordered index files maintained by ordering each key in sequence Partitioned Hashing Given a composite key (k, k, k n ), partitioned hashing returns n different bucket numbers Hash bucket is determined by concatenating the n numbers 36

Grid Files Partitions the range of key values for each key into several buckets Combinations of buckets of each key forms a grid A grid file stores a grid in either a row major or a column major form Grade A B C Grid Files D Roll No 3 4 5 Bucket Pool Roll No 00 05 06 050 3 05 075 4 076 00 5 0 5 37

Summary Multi-level Indexes Trees, root node, leaf nodes, non-leaf (internal) nodes Dynamic multi-level indexes, B-trees and B + trees Insertion and deletion in B + trees Indexes on multiple attributes 38