Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Similar documents
CS232A: Database System Principles INDEXING. Indexing. Indexing. Given condition on attribute find qualified records Attr = value

Data Storage and Query Answering. Indexing and Hashing (5)

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Chapter 13: Indexing. Chapter 13. ? value. Topics. Indexing & Hashing. value. Conventional indexes B-trees Hashing schemes (self-study) record

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing

CS 525: Advanced Database Organization 04: Indexing

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

CSE 562 Database Systems

CS 245: Database System Principles

Chapter 11: Indexing and Hashing

key h(key) Hash Indexing Friday, April 09, 2004 Disadvantages of Sequential File Organization Must use an index and/or binary search to locate data

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Chapter 11: Indexing and Hashing

CS 525: Advanced Database Organization

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Remember. 376a. Database Design. Also. B + tree reminders. Algorithms for B + trees. Remember

Multidimensional Indexes [14]

Indexing and Hashing

COMP 430 Intro. to Database Systems. Indexing

CSIT5300: Advanced Database Systems

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing

Database index structures

Intro to DB CHAPTER 12 INDEXING & HASHING

Indexing: Overview & Hashing. CS 377: Database Systems

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

CMSC 424 Database design Lecture 13 Storage: Files. Mihai Pop

CS34800 Information Systems

Chapter 12: Indexing and Hashing (Cnt(

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

Data Organization B trees

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Kathleen Durant PhD Northeastern University CS Indexes

Tree-Structured Indexes

Physical Level of Databases: B+-Trees

Material You Need to Know

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Hashed-Based Indexing

Chapter 17 Indexing Structures for Files and Physical Database Design

Indexing Methods. Lecture 9. Storage Requirements of Databases

amiri advanced databases '05

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Lecture 8 Index (B+-Tree and Hash)

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Hash-Based Indexes. Chapter 11

Indexing. Announcements. Basics. CPS 116 Introduction to Database Systems

Hashing file organization

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Hashing Techniques. Material based on slides by George Bebis

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Some Practice Problems on Hardware, File Organization and Indexing

CSC 261/461 Database Systems Lecture 17. Fall 2017

Tree-Structured Indexes

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Goals for Today. CS 133: Databases. Example: Indexes. I/O Operation Cost. Reason about tradeoffs between clustered vs. unclustered tree indexes

Physical Database Design: Outline

QUIZ: Buffer replacement policies

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Storage hierarchy. Textbook: chapters 11, 12, and 13

Outline. Database Management and Tuning. Index Data Structures. Outline. Index Tuning. Johann Gamper. Unit 5

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.


Chapter 5: Physical Database Design. Designing Physical Files

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Database Applications (15-415)

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

Indexing and Hashing

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

Database Management and Tuning

Outline. Database Management and Tuning. What is an Index? Key of an Index. Index Tuning. Johann Gamper. Unit 4

The physical database. Contents - physical database design DATABASE DESIGN I - 1DL300. Introduction to Physical Database Design

Hash-Based Indexing 1

CS 350 Algorithms and Complexity

Midterm Review. March 27, 2017

File Structures and Indexing

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

2, 3, 5, 7, 11, 17, 19, 23, 29, 31

(i) It is efficient technique for small and medium sized data file. (ii) Searching is comparatively fast and efficient.

Tree-Structured Indexes. Chapter 10

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

TUTORIAL ON INDEXING PART 2: HASH-BASED INDEXING

Introduction to Data Management. Lecture 15 (More About Indexing)

Lecture 13. Lecture 13: B+ Tree

Data Management for Data Science

Instructor: Amol Deshpande

RAID in Practice, Overview of Indexing

System Structure Revisited

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Hash Table and Hashing

CSE 544 Principles of Database Management Systems

Transcription:

Access Methods This is a modified version of Prof. Hector Garcia Molina s slides. All copy rights belong to the original author. Basic Concepts search key pointer Value record? value Search Key - set of attributes used to look up records in a file. Index Evaluation Metrics Access types supported efficiently. E.g., Point query: find Tom Range query: find students whose age is between - Access time Update time Space overhead 1

Ordered Indices In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library. same order Search key Primary index Also called clustering index The search key of a primary index is usually but not necessarily the primary key. 70 90 1 1 1 170 190 2 2 60 70 80 90 0 Secondary index: non-clustering index. 60 70... different order 70 80 0 90 60 Search key 2

Dense Index: contains index records for every search-key values. Dense Index 60 70 80 90 0 1 1 Sequential File 60 70 80 90 0 Sparse Index: contains index records for only some searchkey values. Applicable when records are sequentially ordered on search-key Sparse Index 70 90 1 1 1 170 190 2 2 Sequential File 60 70 80 90 0 Secondary indexes Sparse index 80 0 90... does not make sense! 70 80 0 90 60 Sequence field 3

Multilevel Index Sparse 2nd level 90 170 2 3 4 490 570 70 90 1 1 1 170 190 2 2 Sequential File 60 70 80 90 0 Multilevel Index Secondary indexes 90... sparse high level 60 70... Lowest level is dense Other levels are sparse 70 80 0 90 60 Sequence field Conventional indexes Advantage: -Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive 4

Outline: Conventional indexes B+-Tree NEXT NEXT: Another type of index Give up on sequentiality of index Try to get balance B+Tree Example n=4 Root 3 5 11 35 0 1 1 1 1 1 156 179 180 0 1 1 180 0 5

Sample non-leaf 57 81 95 to keys to keys to keys to keys < 57 57 k<81 81 k<95 95 Key is moved (not copied) from lower level non-leaf node to upper level non-leaf node Sample leaf node: From non-leaf node to next leaf in sequence To record with key 57 To record with key 81 To record with key 85 57 81 95 Key is copied (not moved) from leaf node to non-leaf node n=4 Leaf: 35 35 Non-leaf: 6

Size of nodes: n pointers n-1 keys Don t want nodes to be too empty Use at least Root : 2 pointers Non-leaf: n/2 pointers Leaf : (n-1)/2 keys n=4 Full node min. node Non-leaf 1 1 180 Leaf 3 5 11 35 counts even if null 7

B+tree rules tree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for sequence pointer (3) Number of pointers/keys for B+tree Max Max Min ptrs keys ptrs data Min keys Non-leaf (non-root) n n-1 n/2 n/2-1 Leaf (non-root) n n-1 (n-1)/2 (n-1)/2 Root n n-1 2 1 Insert into B+tree (a) simple case space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root 8

(a) Insert key = 32 n=4 3 5 11 31 32 0 (b) Insert key = 7 n=4 3 5 3 5 11 7 31 7 0 (c) Insert key = 160 n=4 1 156 179 160 179 180 0 1 1 180 180 0 160 9

(d) New root, insert 45 n=4 new root 1 2 3 12 25 32 45 Deletion from B+tree (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf (b) Coalesce with sibling Delete n=5 0

(c) Redistribute keys Delete n=5 35 35 0 35 (d) Non-leaf coalesce Delete 37 n=5 new root 1 3 25 14 22 25 26 37 45 25 B+tree deletions in practice Often, coalescing is not implemented Too hard and not worth it! 11

Index Definition in SQL Create an index create index <index-name> on <relation-name> (<attribute-list>) E.g.: create index gindex on country(gdp); To drop an index drop index <index-name> E.g.: drop index gindex; Multi-key Index Motivation: Find records where DEPT = Toy AND SAL > k Strategy I: Use one index, say Dept. Get all Dept = Toy records and check their salary I1 12

Strategy II: Use 2 Indexes; Manipulate Pointers Toy Sal > k Strategy III: Multiple Key Index One idea: I2 I1 I3 Example Art Sales Toy Dept Index k 15k 17k 21k 12k 15k 15k 19k Salary Index Example Record Name=Joe DEPT=Sales SAL=15k 13

For which queries is this index good? Find RECs Dept = Sales Find RECs Dept = Sales Find RECs Dept = Sales Find RECs SAL = k SAL=k SAL > k Interesting application: Geographic Data y DATA: <X1,Y1, Attributes> <X2,Y2, Attributes> x... Queries: What city is at <Xi,Yi>? What is within 5 miles from <Xi,Yi>? Which is closest point to <Xi,Yi>? 14

Example 25 15 35 l j k h i n o m e f g d b c a 5 h i g f d e c a b 15 15 j k l m n o Search points near f Search points near b Queries Find points with Yi > Find points with Xi < 5 Find points close to i = <12,38> Find points close to b = <7,24> Many types of geographic index structures have been suggested Quad Trees R Trees 15

Two more types of multi key indexes Grid Bitmap index Grid Index Key 2 Key 1 V1 V2 X1 X2 Xn Vn To records with key1=v3, key2=x2 CLAIM Can quickly find records with key 1 = V i Key 2 = X j key 1 = V i key 2 = X j And also ranges. E.g., key 1 V i key 2 < X j 16

But there is a catch with Grid Indexes! How is Grid Index stored on disk? Like Array... V1 V2 V3 X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4 Problem: Need regularity so we can compute position of <Vi,Xj> entry Solution: Use Indirection V1 V2 V3 V4 Buckets X1 X2 X3 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Buckets *Grid only contains pointers to buckets With indirection: Grid can be regular without wasting space We do have price of indirection 17

Can also index grid on value ranges Salary Grid 0-K 1 K-K 2 K- 3 8 Linear Scale 1 2 3 Toy Sales Personnel Grid files + Good for multiple-key search - Space, management overhead (nothing is free) - Need partitioning ranges that evenly split keys Example Grid File for account Divide branch-name into non-uniform intervals? two attributes as search key Branch-name <Central and k<=balance<k Divide balance into nonuniform intervals What about Central<=branch-name<Townsend and k<=balance? 18

Example Grid File for account Bj Bk Grid Files (Cont.) Linear scales must be chosen to uniformly distribute records across cells. Otherwise there will be too many overflow buckets. Periodic re-organization to increase grid size will help. But reorganization can be very expensive. Space overhead of grid array can be high. Bitmap Indices Another index could be used for multiple valued search keys 19

Bitmap Indices (Cont.) The income-level value of record 3 is L1 Bitmap(size = table size) Unique values of gender Unique values of income-level Bitmap Indices (Cont.) Some properties of bitmap indices Number of bitmaps for each attribute? Size of each bitmap? When is the bitmap matrix sparse and what attributes are good for bitmap indices? Bitmap Indices (Cont.) Bitmap indices generally very small compared with relation size E.g. if record is 0 bytes, space for a single bitmap is 1/800 of space used by relation. If number of distinct attribute values is 8, bitmap is only 1% of relation size What about insertion? Deletion?

Bitmap Indices Queries Sample query: Males with income level L1 0 AND 0 = 000 even faster! What about the number of males with income level L1? Bitmap Indices Queries Queries are answered using bitmap operations Intersection (and) Union (or) Complementation (not) Hashing key h(key) <key>. Buckets (typically 1 disk block) 21

Two alternatives. (1) key h(key) records. Two alternatives (2) key h(key) key 1 record Index Alt (2) for secondary search key Example hash function Key = x1 x2 xn n byte character string Have b buckets h: add x1 + x2 +.. xn compute sum modulo b 22

This may not be best function Good hash function: Expected number of keys/bucket is the same for all buckets Within a bucket: Do we keep keys sorted? Yes, if CPU time critical & Inserts/Deletes not too frequent Next: example to illustrate inserts, overflows, deletes h(k) 23

EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 d a c b e 3 h(e) = 1 EXAMPLE: deletion Delete: e f c 0 1 2 a b c e d d 3 f g maybe move g up Rule of thumb: Try to keep space utilization between % and 80% Utilization = # keys used total # keys that fit If < %, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket 24

How do we cope with growth? Overflows and reorganizations Dynamic hashing Extensible Linear Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(k) 0011 use i grows over time. (b) Use directory h(k)[i ]. to bucket. 25

Example: h(k) is 4 bits; 2 keys/bucket i = 1 1 0001 i = 2 00 01 1 2 01 10 11 Insert 1 2 10 New directory Example continued i = 2 00 01 11 Insert: 0111 0000 2 0000 0001 1 2 0001 0111 0111 2 01 2 10 Example continued i = 2 00 01 0000 0001 0111 2 2 i = 3 000 001 0 11 Insert: 01 01 01 01 10 3 2 3 2 011 0 1 1 111 26

Extensible hashing: deletion No merging of blocks Merge blocks and cut directory if possible (Reverse insert procedure) Deletion example: Run thru insert example in reverse! Summary + Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations - - Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) 27

Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash b 0111 grows i (b) File grows linearly Example b=4 bits, i =2, 2 keys/bucket 01 insert 01 can have overflow chains! 0000 01 1111 00 01 11 m = 01 (max used block) Future growth buckets Rule If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] -2 i -1 Example b=4 bits, i =2, 2 keys/bucket 01 insert 01 0000 01 01 1111 1111 00 01 11 m = 01 (max used block) 11 Future growth buckets 28

Example Continued: How to grow beyond this? i = 2 3 0000 01 01 1111 000 0 0 11 0 0 1 1 111 m = 11 (max used block) 0 1 0 01 01 1... When do we expand file? Keep track of: # used slots total # of slots = U If U > threshold then increase m (and maybe i ) + Summary Linear Hashing Can handle growing files - with less wasted space - with no full reorganizations + No indirection like extensible hashing - Can still have overflow chains 29

Example: BAD CASE Very full Very empty Need to move m here Would waste space... Summary Hashing -How it works - Dynamic hashing -Extensible - Linear Indexing vs Hashing Hashing good for probes given key e.g., SELECT FROM R WHERE R.A = 5

Indexing vs Hashing INDEXING good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 31