Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Similar documents
Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Hashing Techniques. Material based on slides by George Bebis

Hash-Based Indexes. Chapter 11

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Data Structures And Algorithms

Hash Tables. Hashing Probing Separate Chaining Hash Function

Hashed-Based Indexing

Hash-Based Indexing 1

Hashing. Hashing Procedures

Comp 335 File Structures. Hashing

TABLES AND HASHING. Chapter 13

COMP171. Hashing.

Lecture 8 Index (B+-Tree and Hash)

Topic HashTable and Table ADT

Hash Table and Hashing

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Data and File Structures Chapter 11. Hashing

key h(key) Hash Indexing Friday, April 09, 2004 Disadvantages of Sequential File Organization Must use an index and/or binary search to locate data

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Question Bank Subject: Advanced Data Structures Class: SE Computer

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

AAL 217: DATA STRUCTURES

Open Addressing: Linear Probing (cont.)

CSCD 326 Data Structures I Hashing

Chapter 12: Indexing and Hashing. Basic Concepts

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Hash-Based Indexes. Chapter 11 Ramakrishnan & Gehrke (Sections ) CPSC 404, Laks V.S. Lakshmanan 1

Understand how to deal with collisions

Chapter 12: Indexing and Hashing

Hashing file organization

UNIT III BALANCED SEARCH TREES AND INDEXING

CS 2412 Data Structures. Chapter 10 Sorting and Searching

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

CSE 214 Computer Science II Searching

Chapter 11: Indexing and Hashing

Introduction to Hashing

HASH TABLES.

DATA STRUCTURES/UNIT 3

Adapted By Manik Hosen

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

BBM371& Data*Management. Lecture 6: Hash Tables

Topics to Learn. Important concepts. Tree-based index. Hash-based index

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

HASH TABLES. Goal is to store elements k,v at index i = h k

Hashing. Data organization in main memory or disk

CSIT5300: Advanced Database Systems

Chapter 27 Hashing. Objectives

Chapter 12: Indexing and Hashing (Cnt(

1. Attempt any three of the following: 15

Algorithms and Data Structures

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Indexing: Overview & Hashing. CS 377: Database Systems

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Worst-case running time for RANDOMIZED-SELECT

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Outline. (Static) Hashing. Faloutsos - Pavlo CMU SCS /615

CS 241 Analysis of Algorithms

CS301 - Data Structures Glossary By

Data Structures and Algorithms(10)

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Hashing 1. Searching Lists

Hash Tables and Hash Functions

Hashing. Given a search key, can we guess its location in the file? Goal: Method: hash keys into addresses

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Chapter 12: Indexing and Hashing

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Tree-Structured Indexes. Chapter 10

Chapter 6. Hash-Based Indexing. Efficient Support for Equality Search. Architecture and Implementation of Database Systems Summer 2014

Hashing HASHING HOW? Ordered Operations. Typical Hash Function. Why not discard other data structures?

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Database Applications (15-415)

Chapter 1 Disk Storage, Basic File Structures, and Hashing.

Successful vs. Unsuccessful

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

TUTORIAL ON INDEXING PART 2: HASH-BASED INDEXING

Outline. hash tables hash functions open addressing chained hashing

DATA STRUCTURES AND ALGORITHMS

CMSC 341 Hashing. Based on slides from previous iterations of this course

HASH TABLES. Hash Tables Page 1

Direct File Organization Hakan Uraz - File Organization 1

THINGS WE DID LAST TIME IN SECTION

Dictionaries and Hash Tables

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

ASSIGNMENTS. Progra m Outcom e. Chapter Q. No. Outcom e (CO) I 1 If f(n) = Θ(g(n)) and g(n)= Θ(h(n)), then proof that h(n) = Θ(f(n))

Transcription:

Hashing

Symbol Table Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management In general, the following operations are performed on a symbol table determine if a particular name is in the table retrieve the attribute of that name modify the attributes of that name insert a new name and its attributes delete a name and its attributes

Symbol Table (Cont.) Popular operations on a symbol table include search, insertion, and deletion A binary search tree could be used to represent a symbol table. The worse-case complexities for the operations are O(n). Hashing: insertions, deletions & finds in constant time. General data structure of hashing array of fixed size(tablesize) containing the keys hash function

Hashing Static hashing perfect hashing without collision hashing with collision separate chaining open addressing linear probing quadratic probing double hashing rehashing Dynamic hashing extensible hashing linear hashing

Static Hashing Identifiers are stored in a fixed-size table called hash table. The address of location of an identifier, x, is obtained by computing some arithmetic function h(x). The memory available to maintain the symbol table (hash table) is assumed to be sequential. The hash table consists of b buckets and each bucket contains s records. h(x) maps the set of possible identifiers onto the integers through b-.

Hash Tables The identifier density of a hash table is the ratio n/t, where n is the number of identifiers in the table and T is the total number of possible identifiers. The loading density or loading factor of a hash table is α= n/(sb).

Hash Tables (Cont.) Two identifiers, I, and I 2, are said to be synonyms with respect to h if h(i ) = h(i 2 ). An overflow occurs when a new identifier i is mapped or hashed by h into a full bucket. A collision occurs when two non-identical identifiers are hashed into the same bucket. If the bucket size is, collisions and overflows occur at the same time.

Example 8. 2 3 4 5 Slot Slot 2 A A2 D Assume there are distinct identifiers GA, D,A,G,L,A2,A,A3,A4 and E Large number of collisions and overflows! 6 GA G If no overflow occur, the time required for hashing depends only on the time required to compute the hash function h. 25

Hash Function Simple to compute Minimize the number of collisions Dependent upon all the characters in the identifiers uniform hash function If there are b buckets, we hope to have h(x)=i with the probability being (/b)

Mid-Square Mid-Square function h m is computed by squaring the identifier and then using an appropriate number of bits from the middle of the square to obtain the bucket address.

Division Using the modulo (%) operator. An identifier x is divided by some number M and the remainder is used as the hash address for x. The bucket addresses are in the range of through M-. If M is a power of 2, then h D (x) depends only on the least significant bits of x.

Division (Cont.) If identifiers are stored right-justified with leading zeros and M = 2 i, i 6, the identifiers A, B, C, etc., all have the same bucket. If a division function h D is used as the hash function, the table size should not be a power of two. If M is divisible by two, the odd keys are mapped to odd buckets and even keys are mapped to even buckets. Thus, the hash table is biased.

Folding The identifier x is partitioned into several parts, all but the last being of the same length. All partitions are added together to obtain the hash address for x. Shift folding: different partitions are added together to get h(x). Folding at the boundaries: identifier is folded at the partition boundaries, and digits falling into the same position are added together to obtain h(x). This is similar to reversing every other partition and then adding.

Example 8.2 x=23232422 are partitioned into three decimal digits long. P = 23, P2 = 23, P3 = 24, P4 = 2, P5 = 2. Shift folding: h( x) = i= Folding at the boundaries: 5 P i = 23+ 23+ 24+ 2 + 2 = 699 23 23 24 2 2 Folding time 23 23 2 42 Folding 2 times h(x) = 23 + 32 + 24 + 2 + 2 = 897 23 32 2 24

Overflow Handling There are two ways to handle overflow: Open addressing Chaining

Open Addressing Assumes the hash table is an array The hash table is initialized so that each slot contains the null identifier. When a new identifier is hashed into a full bucket, find the closest unfilled bucket. linear probing or linear open addressing

Example 8.3 Assume 26-bucket table with one slot per bucket and the following identifiers: GA, D, A, G, L, A2, A, A3, A4, Z, ZA, E. Let the hash function h(x) = first character of x. When entering G, G collides with GA and is entered at ht[7] instead of ht[6]. 2 3 4 5 6 7 8 9 A A2 A D A3 A4 GA G ZA E 25

Open Addressing (Cont.) When linear open address is used to handle overflows, a hash table search for identifier x proceeds as follows: compute h(x) examine identifiers at positions ht[h(x)], ht[h(x)+],, ht[h(x)+j], in this order until one of the following condition happens: ht[h(x)+j]=x; in this case x is found ht[h(x)+j] is null; x is not in the table We return to the starting position h(x); the table is full and x is not in the table

Linear Probing In example 8.3, we found that when linear probing is used to resolve overflows, identifiers tend to cluster together. Also adjacent clusters tend to coalesce. This increases search time. e.g., to find ZA, you need to examine ht[25], ht[],, ht[8] (total of comparisons). To retrieve each identifier once, a total of 39 buckets are examined (average 3.25 bucket per identifier). P = Even though the average number of probes is small, the worst case can be quite large. 2 3 4 5 6 7 8 9 25 A A2 A D A3 A4 GA G ZA E Z

Quadratic Probing One of the problems of linear open addressing is that it tends to create clusters of identifiers. These clusters tend to merge as more identifiers are entered, leading to bigger clusters. A quadratic probing scheme improves the growth of clusters. A quadratic function of i is used as the increment when searching through buckets. Perform search by examining bucket h(x), (h(x)+i 2 )%b, (h(x)-i 2 )%b for i (b-)/2.

Rehashing Another way to control the growth of clusters is to use a series of hash functions h, h 2,, h m. This is called rehashing. Buckets h i (x), i m are examined in that order.

Open Addressing If collisions, alternative cells are tried until an empty cell is found. H i (K) = (h(k)+ F(i)) % TS, F()= F: collision resolution strategy linear probing: F is a linear function quadratic probing: F is a quadratic function double hashing: F is a second hash function

Linear Probing H i (K) = (h(k)+ F(i)) % TS, F()=, F is linear Example: F(i)=i, => H i (K) = (h(k) + i ) % 49 58 69 2 3 4 5 6 7 8 9 * insert 89, 8, 49, 58, 69 Primary clustering: effect that any key hashes into the cluster will require several attempts to resolve collisions & be added to the cluster 8 89

Quadratic Probing H i (K) = (h(k)+ F(i)) % TS, F()=, F is quadratic Example: F(i)=i 2, H i (K) = (h(k) + i 2 ) % 49 58 69 8 89 2 3 4 5 6 7 8 9 * insert 89, 8, 49, 58, 69

Chaining linear probing performs poorly because the search for an identifier involves comparisons with identifiers that have different hash values. e.g., search of ZA involves comparisons with the buckets ht[] ht[7] which are not possible of colliding with ZA. Unnecessary comparisons can be avoided if all the synonyms are put in the same list, where one list per bucket. Each chain has a head node. Head nodes are stored sequentially.

Hash Chain Example 2 3 4 5 6 7 8 9 ht A4 A3 A A2 A D E G GA L Average search length is (6*+3*2+*3+*4+*5)/2 = 2 25 ZA Z

Hash Functions Theoretically, the performance of a hash table depends only on the method used to handle overflows and is independent of the hash function as long as an uniform hash function is used.

Theoretical Evaluation of Overflow Techniques In general, hashing provides pretty good performance over conventional techniques such as balanced tree. However, the worst-case performance of hashing can be O(n).

Example Instance of Students Relation sid name login age gpa 53666 Jones jones@cs 8 3.4 53688 Smith smith@eecs 8 3.2 5365 Chris chris@math 9 3.8 Cardinality = 3, degree = 5, all rows distinct

Record Formats: Fixed Length F F2 F3 F4 L L2 L3 L4 Base address (B) Address = B+L+L2 Information about field types same for all records in a file; stored in system catalogs.

Static Hashing # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. h(k) mod M = bucket to which data entry with key k belongs. (M = # of buckets) h(key) mod N key h 2 N- Primary bucket pages Overflow pages

Dynamic Hashing Retain the fast retrieval time dynamically increasing and decreasing file size without penalty Assume a file F is a collection of records R. Each record has a key field K Records are stored in pages or buckets Dynamic hashing is to minimize access to pages

Dynamic Hashing Using Directories Given a list of identifiers in the following: Identifiers Binary representation A A B B C C C2 C3 C5 Now put these identifiers into a table of four pages. Each page can hold at most two identifiers, and each page is indexed by two-bit sequence,,,. Now place A, B, C2, A, B, and C3 in a binary tree, called Trie, which is branching based on the last significant bit at root. If the bit is, the upper branch is taken; otherwise, the lower branch is taken. Repeat this for next least significant bit for the next level.

A Trie To Hold Identifiers A, B A, B C2 A, B inserting C5 ( ) C2 A, B C5 (a) two-level trie on four pages C3 A, B C2 A, C C3 (b) inserting C5 with overflow C3 C5 B (c) inserting C with overflow

Issues of Trie Representation From the example, we find that two major factors that affects the retrieval time. Access time for a page depends on the number of bits needed to distinguish the identifiers. If identifiers have a skewed distribution, the tree is also skewed.

Extensible Hashing Fagin et al. present a method, called extensible hashing, for solving the above issues. A hash function is used to avoid skewed distribution. The function takes the key and produces a random set of binary digits. To avoid long search down the trie, the trie is mapped to a directory, where a directory is a table of pointers. If k bits are needed to distinguish the identifiers, the directory has 2 k entries indexed,,, 2 k - Each entry contains a pointer to a page.

Trie Collapsed Into Directories a A, B c A, B b C2 d C3 a A, B b C2 e c A, B d C5 C3 a c b e a d b e A, B A, B C2 C3 C5 a c b f a e b f a d b f a e b f A, B A, C C2 C3 C5 B (a) 2 bits (b) 3 bits 4 bits

Example LOCAL DEPTH GLOBAL DEPTH 2 4* 2* 32* 6* Bucket A According to binary representation (left to right) 2 2 * 5* 2* 3* Bucket B 2 * Bucket C DIRECTORY 2 5* 7* 9* Bucket D DATA PAGES Insert: If bucket is full, split it (allocate new page, re-distribute)

Insert h(r)=2 (Causes Doubling) LOCAL DEPTH 2 GLOBAL DEPTH 32* 6* Bucket A LOCAL DEPTH GLOBAL DEPTH 3 32* 6* Bucket A 2 2 * 5* 2* 3* 2 * Bucket B Bucket C 3 2 * 5* 2* 3* 2 * Bucket B Bucket C DIRECTORY 2 5* 7* 9* Bucket D 2 5* 7* 9* Bucket D 2 4* 2* 2* Bucket A2 (`split image' of Bucket A) DIRECTORY 3 4* 2* 2* Bucket A2 (`split image' of Bucket A)

Hashing With Directory Using a directory to represent a trie allows table of identifiers to grow and shrink dynamically. Accessing any page only requires two steps: First step: use the hash function to find the address of the directory entry. Second step: retrieve the page associated with the address

Directoryless Dynamic Hashing Directoryless hashing (or linear hashing) assume a continuous address space in the memory to hold all the records. Therefore, the directory is not needed. Thus, the hash function must produce the actual address of a page containing the key. Contrasting to the directory scheme in which a single page might be pointed at by several directory entries, in the directoryless scheme one unique page is assigned to every possible address.

A Trie Mapped To Directoryless, Continuous Storage A, B C2 A, B C3 A B C2 - A B C3 -

Directoryless Scheme Overflow Handling A B C2 - A B C3 - A B C2 - A B C3 - A B C2 - A B C3 - - - C5 new page C C5 - - new page - - overflow page