Hash-Based Indexing 1

Similar documents
Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Hash-Based Indexes. Chapter 11

Hashed-Based Indexing

Hash-Based Indexes. Chapter 11 Ramakrishnan & Gehrke (Sections ) CPSC 404, Laks V.S. Lakshmanan 1

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Spring

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Fall

Lecture 8 Index (B+-Tree and Hash)

CSIT5300: Advanced Database Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

TUTORIAL ON INDEXING PART 2: HASH-BASED INDEXING

Database Applications (15-415)

Chapter 6. Hash-Based Indexing. Efficient Support for Equality Search. Architecture and Implementation of Database Systems Summer 2014

Database Applications (15-415)

4 Hash-Based Indexing

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Material You Need to Know

Introduction to Data Management. Lecture 15 (More About Indexing)

Module 5: Hash-Based Indexing

Kathleen Durant PhD Northeastern University CS Indexes

Database Applications (15-415)

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Hashing file organization

Understand how to deal with collisions

key h(key) Hash Indexing Friday, April 09, 2004 Disadvantages of Sequential File Organization Must use an index and/or binary search to locate data

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Data and File Structures Chapter 11. Hashing

Tree-Structured Indexes

Chapter 11: Indexing and Hashing

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Indexing: Overview & Hashing. CS 377: Database Systems

Chapter 12: Indexing and Hashing. Basic Concepts

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Chapter 12: Indexing and Hashing

Tree-Structured Indexes

CS698F Advanced Data Management. Instructor: Medha Atre. Aug 11, 2017 CS698F Adv Data Mgmt 1

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Tree-Structured Indexes

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Hash Tables. Hashing Probing Separate Chaining Hash Function

Hashing Techniques. Material based on slides by George Bebis

Hashing. Data organization in main memory or disk

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Indexing and Hashing

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Chapter 1 Disk Storage, Basic File Structures, and Hashing.

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Chapter 11: Indexing and Hashing

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Chapter 11: Indexing and Hashing

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Chapter 12: Indexing and Hashing

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Some Practice Problems on Hardware, File Organization and Indexing

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Tree-Structured Indexes. Chapter 10

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Tree-Structured Indexes

Tree-Structured Indexes

HW 2 Bench Table. Hash Indexes: Chap. 11. Table Bench is in tablespace setq. Loading table bench. Then a bulk load

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Dictionaries and Hash Tables

Storage hierarchy. Textbook: chapters 11, 12, and 13

Overview of Storage and Indexing

Intro to DB CHAPTER 12 INDEXING & HASHING

Introduction to Hashing

Introduction. Choice orthogonal to indexing technique used to locate entries K.

Tree-Structured Indexes

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

File Organization and Storage Structures

Chapter 12: Indexing and Hashing (Cnt(

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

else // m + 1 = d + }{{} 1 + d

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

2, 3, 5, 7, 11, 17, 19, 23, 29, 31

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

CS 350 Algorithms and Complexity

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

LH*Algorithm: Scalable Distributed Data Structure (SDDS) and its implementation on Switched Multicomputers

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

CSCD 326 Data Structures I Hashing

DATA STRUCTURES/UNIT 3

Tree-Structured Indexes (Brass Tacks)

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

THINGS WE DID LAST TIME IN SECTION

Transcription:

Hash-Based Indexing 1

Tree Indexing Summary Static and dynamic data structures ISAM and B+ trees Speed up both range and equality searches B+ trees very widely used in practice ISAM trees can be useful if dataset relatively static (not a lot of overflow pages) Because index is static, no need to lock index pages

Indexing using Hashing Hash-based indexes are for equality selections. Cannot support range searches. Static and dynamic hashing techniques exist; tradeoffs similar to ISAM vs. B+ trees. 3

Indexing Using Hashing As for any index, 3 alternatives for data entries k*: Data record with key value k <k, rid of data record with search key value k> <k, list of rids of data records with search key k> 4

Static Hashing For each key k, compute some hash function h(k) h(k) mod N = bucket to which data entry with key k belongs. (N= # of buckets) h(key) mod N key h 0 1 N-1 Primary bucket pages Overflow pages 5

Static Hashing Hash fn works on search key field of record r. h(k) mod N must distribute values over range 0... N-1. h(k) = (a * k + b) usually works well. a and b are constants that can be used to tune h 6

Static Hashing Primary bucket pages fixed, allocated sequentially, never de-allocated; overflow pages if needed If no overflow pages, lookup just one disk I/O Overflow pages degrade performance 7

Static Hashing If too many overflow pages, can rehash (reorganize into more buckets) E.g. double the number of buckets Problems: Takes time if index is big Index cannot be used during rehashing 8

Extendible Hashing Better idea: split only the bucket that has overflowed Keep a directory of pointers to buckets, so searches for the old bucket are properly directed to the two new ones When bucket is split, only directory needs to be adjusted, and this is pretty small 9

Example LOCAL DEPTH GLOBAL DEPTH 4* 1* 3* 16* Bucket A 00 1* 5* 1* 13* Bucket B 01 10 11 10* Bucket C DIRECTORY 15* 7* 19* Bucket D Notation: 1* = data entry with hash value 1 DATA PAGES To find bucket for record with key k, look at last bits of h(k) E.g. if h(k) = 14 i.e. 1110, goes into 3 rd bucket 10

Example LOCAL DEPTH GLOBAL DEPTH 4* 1* 3* 16* Bucket A Now suppose want to insert record 01 with h(k) = 0 0 in binary is 10100 00 10 11 1* 5* 1* 10* 13* Bucket B Bucket C Belongs in first bucket which is full! Let's split that bucket (and only that one) into How to distribute entries among new buckets? Look at 3 rd bit from the right! DIRECTORY 15* 7* 19* DATA PAGES Bucket D 11

Insert h(r)=0 (Causes Doubling) LOCAL DEPTH GLOBAL DEPTH 3*16* Bucket A LOCAL DEPTH GLOBAL DEPTH 3 3* 16* Bucket A 00 01 10 11 1* 5* 1*13* 10* Bucket B Bucket C 000 001 010 011 3 1* 5* 1* 13* 10* Bucket B Bucket C DIRECTORY 15* 7* 19* Bucket D 100 101 110 15* 7* 19* Bucket D 4* 1* 0* Bucket A (`split image' of Bucket A) 111 DIRECTORY 3 4* 1* 0* Bucket A (`split image' of Bucket A) 1

Global and local depth Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to. Local depth of a bucket: # of bits used to determine if an entry belongs to this bucket. Bucket C has local depth, so all entries will share the same last bits but may have a different 3 rd -to-last one! 13

Points to Note Splitting does not always require directory doubling! E.g. if we now insert lots of values into Bucket C and need to split it, no need to double directory When does bucket split cause directory doubling? When its local depth pre-split was equal to the global depth! 14

What about deletions? If removal of data entry makes bucket empty, can be merged with `split image. If each directory element points to same bucket as its split image, can halve directory. In practice may omit this step and leave index as is Relatively common for other indexes too e.g. B+ trees Justification: deletes typically less frequent than inserts 15

Performance of Extendible Hashing If directory fits in memory, equality search answered with one disk access; else two. Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large. 16

Performance of Extendible Hashing Doubling the directory is a "stop the world" operation and may still take a while May still need overflow pages if we have hash collisions (two data entries/search keys have same hash value) 17

Linear Hashing Another dynamic hashing scheme, an alternative to Extendible Hashing. LH is similar to EH but does not use a directory Allows some (more) overflow pages instead 18

Linear Hashing Directory avoided in LH by using overflow pages, and choosing bucket to split round-robin. Do not necessarily split the bucket that has overflowed, but rather the bucket whose "turn" it is. 19

Linear Hashing Splitting proceeds in rounds. Round ends when all N R initial (for round R) buckets are split. Current round number is Level. 0

Overview of LH File In the middle of a round. Bucket to be split Next Buckets split in this round: Buckets that existed at the beginning of this round: `split image' buckets: created (through splitting of other buckets) in this round 1

The big question How do we quickly find the right bucket for search key k? After all, that's the whole reason we are doing hashing Linear hashing: have TWO hash functions in play at all times

Hash function families Use a family of hash functions: Range of is twice the range of Can generate them from some "starter" function h that maps search keys to integers Say initial number of buckets is Can take And as last d bits of h(k) is last d+i bits of h(k) E.g. suppose N = 4 3

Why we don't need a directory Bucket to be split Next Buckets that existed at the beginning of this round: this is the range of h Level Buckets split in this round: If h Level ( search key value ) is in this range, must use h Level+1 ( search key value ) to decide if entry is in `split image' bucket. `split image' buckets: created (through splitting of other buckets) in this round 4

Why we don't need a directory For search, need to know which Level (round) we are in. To find bucket for data entry r, find h Level (r): If Next <= h Level (r) <= N R, r belongs here. Else, r could belong to bucket h Level (r) or bucket h Level (r) + N R ; must apply h Level+1 (r) to find out. 5

Example inserting 43* Level=0, N=4 Level=0 h 1 h 0 PRIMARY Next=0 PAGES h 1 h 0 PRIMARY PAGES OVERFLOW PAGES 000 001 00 01 3*44* 36* 9* 5* 5* 000 001 00 01 3* Next=1 9* 5* 5* 010 10 14* 18*10* 010 10 14* 18*10*30* 011 11 31* 35* 7* 11* 011 11 31*35* 7* 11* 43* 100 00 44* 36* Note this is NOT a directory 6

Example: End of a Round, insert 50* h 1 h0 Level=0 PRIMARY PAGES OVERFLOW PAGES h 1 000 001 h 0 00 01 Level=1 Next=0 PRIMARY PAGES 3* 9* 5* OVERFLOW PAGES 000 001 010 011 00 01 10 11 3* 9* 5* 66* 18* 10* 34* Next=3 31* 35* 7* 11* 43* 010 011 100 10 11 00 66* 18* 10* 34* 43* 35* 11* 44* 36* 50* 100 00 44*36* 101 11 5* 37* 9* 101 01 5* 37*9* 110 10 14* 30* * 110 10 14* 30* * 111 11 31*7* 7

Linear Hashing Continued Insert: Find bucket by applying h Level or h Level+1 : If bucket to insert into is full: Add overflow page and insert data entry. (Maybe) Split Next bucket and increment Next. Can choose any criterion to trigger split, eg. desired occupancy Similarities/differences to doubling directory in EH 8

Linear Hashing Continued Delete: can be implemented as reverse of insert, or just ignored (leave index as is) Cost of lookup: 1 I/O if primary bucket pages consecutive on disk 9

Summary Hash-based indexes: best for equality searches, cannot support range searches. Static Hashing can lead to long overflow chains. 30

Summary Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. Directory to keep track of buckets, doubles periodically. Can get large with skewed data; additional I/O if this does not fit in main memory. 31

Summary (Contd.) Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense data areas. Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization. 3