Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Access Methods This is a modified version of Prof. Hector Garcia Molina s slides. All copy rights belong to the original author. Basic Concepts search key pointer Value record? value Search Key - set of attributes used to look up records in a file. Index Evaluation Metrics Access types supported efficiently. E.g., Point query: find Tom Range query: find students whose age is between - Access time Update time Space overhead 1

Ordered Indices In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library. same order Search key Primary index Also called clustering index The search key of a primary index is usually but not necessarily the primary key. 70 90 1 1 1 170 190 2 2 60 70 80 90 0 Secondary index: non-clustering index. 60 70... different order 70 80 0 90 60 Search key 2

Dense Index: contains index records for every search-key values. Dense Index 60 70 80 90 0 1 1 Sequential File 60 70 80 90 0 Sparse Index: contains index records for only some searchkey values. Applicable when records are sequentially ordered on search-key Sparse Index 70 90 1 1 1 170 190 2 2 Sequential File 60 70 80 90 0 Secondary indexes Sparse index 80 0 90... does not make sense! 70 80 0 90 60 Sequence field 3

Multilevel Index Sparse 2nd level 90 170 2 3 4 490 570 70 90 1 1 1 170 190 2 2 Sequential File 60 70 80 90 0 Multilevel Index Secondary indexes 90... sparse high level 60 70... Lowest level is dense Other levels are sparse 70 80 0 90 60 Sequence field Conventional indexes Advantage: -Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive 4

Outline: Conventional indexes B+-Tree NEXT NEXT: Another type of index Give up on sequentiality of index Try to get balance B+Tree Example n=4 Root 3 5 11 35 0 1 1 1 1 1 156 179 180 0 1 1 180 0 5

Sample non-leaf 57 81 95 to keys to keys to keys to keys < 57 57 k<81 81 k<95 95 Key is moved (not copied) from lower level non-leaf node to upper level non-leaf node Sample leaf node: From non-leaf node to next leaf in sequence To record with key 57 To record with key 81 To record with key 85 57 81 95 Key is copied (not moved) from leaf node to non-leaf node n=4 Leaf: 35 35 Non-leaf: 6

Size of nodes: n pointers n-1 keys Don t want nodes to be too empty Use at least Root : 2 pointers Non-leaf: n/2 pointers Leaf : (n-1)/2 keys n=4 Full node min. node Non-leaf 1 1 180 Leaf 3 5 11 35 counts even if null 7

B+tree rules tree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for sequence pointer (3) Number of pointers/keys for B+tree Max Max Min ptrs keys ptrs data Min keys Non-leaf (non-root) n n-1 n/2 n/2-1 Leaf (non-root) n n-1 (n-1)/2 (n-1)/2 Root n n-1 2 1 Insert into B+tree (a) simple case space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root 8

(a) Insert key = 32 n=4 3 5 11 31 32 0 (b) Insert key = 7 n=4 3 5 3 5 11 7 31 7 0 (c) Insert key = 160 n=4 1 156 179 160 179 180 0 1 1 180 180 0 160 9

(d) New root, insert 45 n=4 new root 1 2 3 12 25 32 45 Deletion from B+tree (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf (b) Coalesce with sibling Delete n=5 0

(c) Redistribute keys Delete n=5 35 35 0 35 (d) Non-leaf coalesce Delete 37 n=5 new root 1 3 25 14 22 25 26 37 45 25 B+tree deletions in practice Often, coalescing is not implemented Too hard and not worth it! 11

Index Definition in SQL Create an index create index <index-name> on <relation-name> (<attribute-list>) E.g.: create index gindex on country(gdp); To drop an index drop index <index-name> E.g.: drop index gindex; Multi-key Index Motivation: Find records where DEPT = Toy AND SAL > k Strategy I: Use one index, say Dept. Get all Dept = Toy records and check their salary I1 12

Strategy II: Use 2 Indexes; Manipulate Pointers Toy Sal > k Strategy III: Multiple Key Index One idea: I2 I1 I3 Example Art Sales Toy Dept Index k 15k 17k 21k 12k 15k 15k 19k Salary Index Example Record Name=Joe DEPT=Sales SAL=15k 13

For which queries is this index good? Find RECs Dept = Sales Find RECs Dept = Sales Find RECs Dept = Sales Find RECs SAL = k SAL=k SAL > k Interesting application: Geographic Data y DATA: <X1,Y1, Attributes> <X2,Y2, Attributes> x... Queries: What city is at <Xi,Yi>? What is within 5 miles from <Xi,Yi>? Which is closest point to <Xi,Yi>? 14

Example 25 15 35 l j k h i n o m e f g d b c a 5 h i g f d e c a b 15 15 j k l m n o Search points near f Search points near b Queries Find points with Yi > Find points with Xi < 5 Find points close to i = <12,38> Find points close to b = <7,24> Many types of geographic index structures have been suggested Quad Trees R Trees 15

Two more types of multi key indexes Grid Bitmap index Grid Index Key 2 Key 1 V1 V2 X1 X2 Xn Vn To records with key1=v3, key2=x2 CLAIM Can quickly find records with key 1 = V i Key 2 = X j key 1 = V i key 2 = X j And also ranges. E.g., key 1 V i key 2 < X j 16

But there is a catch with Grid Indexes! How is Grid Index stored on disk? Like Array... V1 V2 V3 X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4 Problem: Need regularity so we can compute position of <Vi,Xj> entry Solution: Use Indirection V1 V2 V3 V4 Buckets X1 X2 X3 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Buckets *Grid only contains pointers to buckets With indirection: Grid can be regular without wasting space We do have price of indirection 17

Can also index grid on value ranges Salary Grid 0-K 1 K-K 2 K- 3 8 Linear Scale 1 2 3 Toy Sales Personnel Grid files + Good for multiple-key search - Space, management overhead (nothing is free) - Need partitioning ranges that evenly split keys Example Grid File for account Divide branch-name into non-uniform intervals? two attributes as search key Branch-name <Central and k<=balance<k Divide balance into nonuniform intervals What about Central<=branch-name<Townsend and k<=balance? 18

Example Grid File for account Bj Bk Grid Files (Cont.) Linear scales must be chosen to uniformly distribute records across cells. Otherwise there will be too many overflow buckets. Periodic re-organization to increase grid size will help. But reorganization can be very expensive. Space overhead of grid array can be high. Bitmap Indices Another index could be used for multiple valued search keys 19

Bitmap Indices (Cont.) The income-level value of record 3 is L1 Bitmap(size = table size) Unique values of gender Unique values of income-level Bitmap Indices (Cont.) Some properties of bitmap indices Number of bitmaps for each attribute? Size of each bitmap? When is the bitmap matrix sparse and what attributes are good for bitmap indices? Bitmap Indices (Cont.) Bitmap indices generally very small compared with relation size E.g. if record is 0 bytes, space for a single bitmap is 1/800 of space used by relation. If number of distinct attribute values is 8, bitmap is only 1% of relation size What about insertion? Deletion?

Bitmap Indices Queries Sample query: Males with income level L1 0 AND 0 = 000 even faster! What about the number of males with income level L1? Bitmap Indices Queries Queries are answered using bitmap operations Intersection (and) Union (or) Complementation (not) Hashing key h(key) <key>. Buckets (typically 1 disk block) 21

Two alternatives. (1) key h(key) records. Two alternatives (2) key h(key) key 1 record Index Alt (2) for secondary search key Example hash function Key = x1 x2 xn n byte character string Have b buckets h: add x1 + x2 +.. xn compute sum modulo b 22

This may not be best function Good hash function: Expected number of keys/bucket is the same for all buckets Within a bucket: Do we keep keys sorted? Yes, if CPU time critical & Inserts/Deletes not too frequent Next: example to illustrate inserts, overflows, deletes h(k) 23

EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 d a c b e 3 h(e) = 1 EXAMPLE: deletion Delete: e f c 0 1 2 a b c e d d 3 f g maybe move g up Rule of thumb: Try to keep space utilization between % and 80% Utilization = # keys used total # keys that fit If < %, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket 24

How do we cope with growth? Overflows and reorganizations Dynamic hashing Extensible Linear Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(k) 0011 use i grows over time. (b) Use directory h(k)[i ]. to bucket. 25

Example: h(k) is 4 bits; 2 keys/bucket i = 1 1 0001 i = 2 00 01 1 2 01 10 11 Insert 1 2 10 New directory Example continued i = 2 00 01 11 Insert: 0111 0000 2 0000 0001 1 2 0001 0111 0111 2 01 2 10 Example continued i = 2 00 01 0000 0001 0111 2 2 i = 3 000 001 0 11 Insert: 01 01 01 01 10 3 2 3 2 011 0 1 1 111 26

Extensible hashing: deletion No merging of blocks Merge blocks and cut directory if possible (Reverse insert procedure) Deletion example: Run thru insert example in reverse! Summary + Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations - - Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) 27

Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash b 0111 grows i (b) File grows linearly Example b=4 bits, i =2, 2 keys/bucket 01 insert 01 can have overflow chains! 0000 01 1111 00 01 11 m = 01 (max used block) Future growth buckets Rule If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] -2 i -1 Example b=4 bits, i =2, 2 keys/bucket 01 insert 01 0000 01 01 1111 1111 00 01 11 m = 01 (max used block) 11 Future growth buckets 28

Example Continued: How to grow beyond this? i = 2 3 0000 01 01 1111 000 0 0 11 0 0 1 1 111 m = 11 (max used block) 0 1 0 01 01 1... When do we expand file? Keep track of: # used slots total # of slots = U If U > threshold then increase m (and maybe i ) + Summary Linear Hashing Can handle growing files - with less wasted space - with no full reorganizations + No indirection like extensible hashing - Can still have overflow chains 29

Example: BAD CASE Very full Very empty Need to move m here Would waste space... Summary Hashing -How it works - Dynamic hashing -Extensible - Linear Indexing vs Hashing Hashing good for probes given key e.g., SELECT FROM R WHERE R.A = 5

Indexing vs Hashing INDEXING good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 31