Hashed-Based Indexing

Similar documents
Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Hash-Based Indexing 1

Hash-Based Indexes. Chapter 11

Hash-Based Indexes. Chapter 11 Ramakrishnan & Gehrke (Sections ) CPSC 404, Laks V.S. Lakshmanan 1

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Spring

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Fall

Lecture 8 Index (B+-Tree and Hash)

CSIT5300: Advanced Database Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Database Applications (15-415)

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Database Applications (15-415)

TUTORIAL ON INDEXING PART 2: HASH-BASED INDEXING

Database Applications (15-415)

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Kathleen Durant PhD Northeastern University CS Indexes

Chapter 6. Hash-Based Indexing. Efficient Support for Equality Search. Architecture and Implementation of Database Systems Summer 2014

Material You Need to Know

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Introduction to Data Management. Lecture 15 (More About Indexing)

Chapter 1 Disk Storage, Basic File Structures, and Hashing.

4 Hash-Based Indexing

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

key h(key) Hash Indexing Friday, April 09, 2004 Disadvantages of Sequential File Organization Must use an index and/or binary search to locate data

Data Storage and Query Answering. Indexing and Hashing (5)

Hashing file organization

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 11: Indexing and Hashing

Chapter 12: Indexing and Hashing

Indexing: Overview & Hashing. CS 377: Database Systems

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Intro to DB CHAPTER 12 INDEXING & HASHING

Data and File Structures Chapter 11. Hashing

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Chapter 13: Indexing. Chapter 13. ? value. Topics. Indexing & Hashing. value. Conventional indexes B-trees Hashing schemes (self-study) record

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Module 5: Hash-Based Indexing

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Hashing. Data organization in main memory or disk

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Tree-Structured Indexes

else // m + 1 = d + }{{} 1 + d

File Organization and Storage Structures

Chapter 12: Indexing and Hashing (Cnt(

The physical database. Contents - physical database design DATABASE DESIGN I - 1DL300. Introduction to Physical Database Design

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Chapter 11: Indexing and Hashing

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 12: Indexing and Hashing

CS 525: Advanced Database Organization 04: Indexing

Indexing and Hashing

Hashing Techniques. Material based on slides by George Bebis

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Tree-Structured Indexes. Chapter 10

Data Management for Data Science

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Physical Level of Databases: B+-Trees

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Storage hierarchy. Textbook: chapters 11, 12, and 13

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

CS 245: Database System Principles

Tree-Structured Indexes

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

LH*Algorithm: Scalable Distributed Data Structure (SDDS) and its implementation on Switched Multicomputers

CS698F Advanced Data Management. Instructor: Medha Atre. Aug 11, 2017 CS698F Adv Data Mgmt 1

CS 350 Algorithms and Complexity

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Tree-Structured Indexes

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

BBM371& Data*Management. Lecture 6: Hash Tables

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Chapter 11: Indexing and Hashing

Tree-Structured Indexes

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

Hash Table and Hashing

Tree-Structured Indexes

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

Indexing. vanilladb.org

Some Practice Problems on Hardware, File Organization and Indexing

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

CSC 261/461 Database Systems Lecture 17. Fall 2017

CSE 562 Database Systems

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

Remember. 376a. Database Design. Also. B + tree reminders. Algorithms for B + trees. Remember

Hashing. Hashing Procedures

Comp 335 File Structures. Hashing

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Introduction. Choice orthogonal to indexing technique used to locate entries K.

HW 2 Bench Table. Hash Indexes: Chap. 11. Table Bench is in tablespace setq. Loading table bench. Then a bulk load

Tree-Structured Indexes

Extendible Chained Bucket Hashing for Main Memory Databases. Abstract

CSE 373 Spring 2010: Midterm #2 (closed book, closed notes, NO calculators allowed)

Transcription:

Topics Hashed-Based Indexing Linda Wu Static hashing Dynamic hashing Extendible Hashing Linear Hashing (CMPT 54 4-) Chapter CMPT 54 4- Static Hashing An index consists of buckets 0 ~ N-1 A bucket consists of one primary page and, possibly, additional overflow pages Buckets contain data entries Hash function h h must distribute values in the domain of search field uniformly over the buckets h(k) = (a * k + b) usually works well, a and b are constants for tuning h h(k) mod N: bucket to which the data entry with search key k belongs Static Hashing (Cont.) key h(key) mod N h 0 N-1 Primary bucket pages Overflow pages # of primary pages, N, is fixed Primary pages are allocated sequentially, never de-allocated; overflow pages allocated if needed Long overflow chains can develop and degrade performance Chapter CMPT 54 4- Chapter CMPT 54 4-4

Static Hashing (Cont.) Search Identify the correct bucket using h Search the bucket for the data entry Insert Identify the correct bucket using h If no space in the bucket, allocate a new overflow page to the overflow chain of the bucket, put the data entry on this page Delete Identify the correct bucket using h Search the bucket for the data entry, remove it If it is the last in an overflow page, remove the page from the chain and de-allocate the page Chapter CMPT 54 4-5 Extendible Hashing To avoid overflow page Possible solution: if a bucket is full, reorganize file by doubling # of buckets and redistributing the entries across the new set of buckets * Reading and writing all pages is expensive! Solution Use a directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! * Directory is much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page! * Trick lies in how hash function is adjusted! Chapter CMPT 54 4-6 Example Directory is an array of size 4 To find the correct bucket for k Take last global depth () bits of h(k) e.g., if h(k) = 5 = 1 (binary), it is in the bucket pointed to by dictionary element To insert If bucket is full, split it (allocate a new page, re-distribute entries) If necessary, double the directory LOCAL DEPTH GLOBAL DEPTH DIRECTORY (contain page ids) 4* 1* * 15* 1* * 1 5* 1* 1* 7* 19* DATA ENTRIES PAGES Bucket A Bucket B Bucket C Bucket D Chapter CMPT 54 4-7 Chapter CMPT 54 4-8

Insert entry r with h(r) = 0 LOCAL DEPTH GLOBAL DEPTH Step 1: split bucket *1 Chapter CMPT 541 4-1 9 1* * 15* 4* 1 5* 1*1* 7* 19* 1*0* Bucket A Bucket B Bucket C Bucket D Bucket A (split image of A) LOCAL DEPTH GLOBAL DEPTH 0 1 0 1 1 1 1 1 1* * 15* 4* *1 5* 1*1* 7* 19* 1* 0* Bucket A Bucket B Bucket C Bucket D Step : double directory and increase depth Bucket A (split image of A) Chapter CMPT 54 4- Points to note Allocate a new bucket page; write both this page and the old bucket page; double the dictionary array h(r) = 0 = binary 1: last bits () tell r belongs in A or A; last bits (0 / ) are needed to tell which bucket Global depth of directory Max # of bits needed to tell which bucket an entry belongs to Local depth of a bucket # of bits used to determine if an entry belongs to this bucket Chapter CMPT 54 4- When does bucket split cause directory doubling? Before inserting, local depth of bucket = global depth Inserting causes local depth to become > global depth Directory is doubled by copying it over and adjusting pointer to split image page (use of least significant bits enables efficient doubling via copying of directory!) Chapter CMPT 54 4-1

0 1 Why use least significant bits in directory? Allows for efficient doubling via copying 1 6 = 1 Least Significant 0 1 0 1 1 1 1 1 1 0 1 6 = 1 Most Significant Chapter CMPT 54 4-1 0 1 0 1 1 1 1 1 Delete Locate the data entry by computing its hash value, taking the last bits, and looking in the bucket pointed to by this directory element Remove the data entry If the removal makes the bucket empty, it can be merged with its split image; local depth is decreased If each directory element points to same bucket as its split image, the directory can be halved; global depth is decreased * The last steps can be omitted Chapter CMPT 54 4-14 I/O cost of equality search If the directory fits in memory, equality search can be answered with one disk access otherwise, two Collision (duplicate) handling Collision: multiple data entries with the same hash value Use overflow page when more data entries than will fit on a page have the same hash value Linear Hashing An alternative to Extendible Hashing Not require directory; handle duplicates Idea: use a family of hash functions h 0, h 1, h, h i (key) = h(key) mod ( i N) N = initial # buckets; h is some hash function (range is not 0 to N-1) If N = d0, h i consists of applying h and looking at the last d i bits, where d i = d 0 + i h i+1 doubles the range of h i (similar to directory doubling) Chapter CMPT 54 4-15 Chapter CMPT 54 4-16

Linear Hashing avoids directory by using overflow pages, and choosing bucket to split round-robin Splitting proceeds in rounds ; current round number is Level (initial value is 0) Bucket to split is denoted by Next Next is initialized to 0 when a new round begins, and increased by 1 after a splitting Buckets (0 ~ Next-1) have been split; buckets (Next ~ N Level ) yet to be split Splitting is triggered when an overflow page is added, and h Level+1 redistributes entries between this bucket and its split image The round Level ends when all N Level initial buckets are split Chapter CMPT 54 4-17 In the middle of a round Bucket to be split Next Buckets that existed at the beginning of this round: this is the range of h Level Buckets split in this round: If h Level ( search key value) is in this range, must use h Level+1(search key value ) to decide if entry is in split image bucket Split image buckets: created (through splitting of other buckets) in this round Chapter CMPT 54 4-18 Search for data entry r To find bucket for data entry r, calculate h Level (r) If h Level (r) is in the range (Next ~ N R ), r belongs here Else, r could belong to bucket h Level (r) or bucket h Level (r) + N Level ; must apply h Level+1 (r) to find out Insert Find bucket by applying h Level / h Level+1 If bucket to insert into is full Add overflow page and insert data entry Split Next bucket (its entries are redistributed by h Level+1 ), and increment Next Actually any criterion can be chosen to trigger splitting Since buckets are split round-robin, long overflow chains don t develop! Doubling of directory in Extendible Hashing is similar; switching of hash functions is implicit in how the # of bits examined is increased Chapter CMPT 54 4-19 Chapter CMPT 54 4-0

h1 0 1 0 1 h0 On split, h Level+1 is used to redistribute entries Level=0, N=4 Next=0 PRIMARY PAGES * 44* 9* 5* 5* 14* 18**0* 1* 5* 7* * (The contents of the linear hashed file) Insert r h(r) = 4 Chapter CMPT 54 4-1 h1 0 1 0 1 1 h0 Next=1 Level=0 PRIMARY PAGES * 9* 5* 5* 14* 18**0* 1*5* 7* * 44* OVERFLOW PAGES 4* h1 0 1 0 1 1 1 1 h0 End of a round Level=0 * 9* 5* 6 18* * 4* Next= 1* 5* 7** 4* 44* 5* 7*9* 14* 0* * Insert 50* Level=1 h1 h0 Next=0 0 * 5* 7* 9* 1 1* 7* Chapter CMPT 54 4-1 0 1 1 1 1 9*5* 618**4* 4* 5** 44* 14* 0* * 50* Extendible vs. Linear Hashing The two schemes are actually quite similar Begin with an imaginary directory in LH with N elements First split is at bucket 0 (the imaginary directory is doubled at this point). Since elements <1,N+1>, <,N+>,... are the same, we need only create directory element N, which differs from 0, now. When bucket 1 splits, create directory element N+1, etc. The directory is doubled gradually. Also, primary bucket pages are created in order. If they are allocated in sequence too, we don t need a directory! Extendible vs. Moving from h i to h i+1 in LH corresponds to doubling the directory in EH Extendible Hashing Directory is doubled in a single step always splitting the appropriate bucket (dense data area): reduced # of splits and a higher bucket occupancy Linear Hashing Directory is doubled gradually over the course of a round A directory can be avoided by a clever choice of the buckets to split Chapter CMPT 54 4- Chapter CMPT 54 4-4