Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Similar documents
Chapter 20 Hash Tables

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Structures, Algorithm Analysis: CHAPTER 5: HASHING

Topic HashTable and Table ADT

Section 1: True / False (1 point each, 15 pts total)

Hash Table and Hashing

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

DATA STRUCTURES AND ALGORITHMS

COMP171. Hashing.

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Hash Tables. Gunnar Gotshalks. Maps 1

Lecture 16 More on Hashing Collision Resolution

Algorithms and Data Structures

Hashing Techniques. Material based on slides by George Bebis

Cpt S 223. School of EECS, WSU

HASH TABLES. Hash Tables Page 1

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

CS 310 Advanced Data Structures and Algorithms

AAL 217: DATA STRUCTURES

Hash Tables. Johns Hopkins Department of Computer Science Course : Data Structures, Professor: Greg Hager

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Hash Tables. Hashing Probing Separate Chaining Hash Function

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

CSE 214 Computer Science II Searching

Hash[ string key ] ==> integer value

Lecture 10 March 4. Goals: hashing. hash functions. closed hashing. application of hashing

Open Addressing: Linear Probing (cont.)

Hash Tables and Hash Functions

HASH TABLES. Goal is to store elements k,v at index i = h k

Lecture 18. Collision Resolution

Hash table basics mod 83 ate. ate

Hash table basics. ate à. à à mod à 83

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Data and File Structures Laboratory

Hash-Based Indexing 1

CS2210 Data Structures and Algorithms

Hash table basics mod 83 ate. ate. hashcode()

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Hashing. Hashing Procedures

Unit #5: Hash Functions and the Pigeonhole Principle

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Standard ADTs. Lecture 19 CS2110 Summer 2009

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Symbol Tables. ASU Textbook Chapter 7.6, 6.5 and 6.3. Tsan-sheng Hsu.

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Dictionaries and Hash Tables

Lecture 4. Hashing Methods

HASH TABLES cs2420 Introduction to Algorithms and Data Structures Spring 2015

CSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed)

Fast Lookup: Hash tables

Hash table basics mod 83 ate. ate

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Understand how to deal with collisions

Hashing. Biostatistics 615 Lecture 12

Introduction to Hashing

Worst-case running time for RANDOMIZED-SELECT

CS 3410 Ch 20 Hash Tables

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

HASH TABLES.

Hashing. mapping data to tables hashing integers and strings. aclasshash_table inserting and locating strings. inserting and locating strings

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Hash Tables Hash Tables Goodrich, Tamassia

DATA STRUCTURES/UNIT 3

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

stacks operation array/vector linked list push amortized O(1) Θ(1) pop Θ(1) Θ(1) top Θ(1) Θ(1) isempty Θ(1) Θ(1)

UNIT III BALANCED SEARCH TREES AND INDEXING

Hashing (Κατακερματισμός)

Computer Science 136 Spring 2004 Professor Bruce. Final Examination May 19, 2004

Priority Queue. 03/09/04 Lecture 17 1

Fundamental Algorithms

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

UNIT III BALANCED SEARCH TREES AND INDEXING

TABLES AND HASHING. Chapter 13

Dictionaries and Hash Tables

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

9/24/ Hash functions

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Algorithms and Data Structures

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Abstract Data Types (ADTs) Queues & Priority Queues. Sets. Dictionaries. Stacks 6/15/2011

CSI33 Data Structures

Design Patterns for Data Structures. Chapter 9.5. Hash Tables

Overview. Department of Computer Science UHD

Chapter 27 Hashing. Objectives

Hashed-Based Indexing

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

Data Structures And Algorithms

Transcription:

1: Tables

Tables The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries Symbol Tables Associative Arrays (eg in awk, perl) Tables 1-1

Tables (continued) Associated with each key is some data The operations required for this ADT are Create a new Table Insert a key, and associated data, into a Table. Delete a key, and associated data, from a Table. Retrieve the data associated with a key from a Table. Tables 1-2

Tables (continued) The Table is essentially a set of (key,data) pairs, with the property that, for any k,d,d, if (k,d) and (k,d ) are in the set, then d = d. In other words a Table provides a function Key Data. Tables 1-3

Implementing Tables Tables can be implemented in many ways Linked lists of (key,data) pairs Tree based representations Ordered binary trees Balanced trees, eg 2-3 trees, Red-Black trees, B-trees, Splay trees See lab exercise for simplified version, with no data associated with keys. Hash tables Tables 1-4

Hash tables When storing and accessing indexed information, most efficient access mechanism is direct addressing. Simplest example is an array a Can access, or update, any array element a[i] in constant time, if we know the index i. If data is stored in other forms of data structure, such as lists, or binary search trees, the access and update cost depends on position of the data in the data structure. Arrays ok for data indexed by integers Need similar mechanisms for data with other types of index, e.g. strings of characters Tables 1-5

Suppose we are given a set K of keys. Hash tables (continued) For direct addressing, need a function h, such that, for any k K h(k) is the address of the location of data associated with k K k * h memory h needs to have the following properties It is easy (and fast) to calculate h(k), given k It is 1-1 (i.e. no two keys give the same value of h) Uses reasonable amount of memory (i.e. the range of h is not too large) All these properties are satisfied by array indexing, but in general are too much to ask for. Tables 1-6

Hash tables (continued) Hash tables give a way round the problem by relaxing the 1-1 condition. A hash table is basically an array K k * h 0 2 4 6 HashSize - 1 A hash function is a function h : K (0...HashSize 1) h is used whenever access to the hash table is required e.g. search, Tables 1-7

Hash tables (continued) insertion, deletion A problem arises when two keys k 1,k 2 have h(k 1 ) = h(k 2 ) Where does data associated with these keys go? Two mechanisms for resolving this type of conflict Open hashing (otherwise known as separate chaining) Closed hashing (otherwise known as open addressing) Tables 1-8

Open hashing No data is stored in the hash table itself Hash table just contains pointers to linked lists of data cells K k1 k2 k3 k3 * k2 * Tables 1-9

Open hashing (continued) First cells are often dummies, containing no data (as in the above diagram). This simplifies code for insertion and deletion Have drawn cells containing only a key and a pointer. The cells could, of course contain additional data associated with the key and usually do Tables 1-10

Need following functions to manipulate hash tables Initialise a hash table, given its size Create array (of appropriate size) of pointers Create header cells for each array entry Open hashing (continued) Find a key k Calculate h(k) look along the list with header at h(k) to see if k is there If it is, return pointer to cell containing k, otherwise return a pointer to the cell after which k should be inserted. Insert a key k If already there, do nothing Otherwise, insert a new cell, containing k in the list starting at h(k) Delete a key k Tables 1-11

Open hashing (continued) If not there, do nothing Otherwise, delete cell containing k from the list starting at h(k), releasing any memory used by cell. Tables 1-12

/* The type for keys */ typedef char * Key_Type; Data types for open hashing /* Type for size of hash table */ typedef unsigned int Hash_size; /* Type for index into hash table */ typedef unsigned int Index; Tables 1-13

Data types for open hashing (continued) /* Type for cells containing keys */ /* No auxiliary data in this example */ typedef struct data_cell *cell_ptr; struct data_cell { Key_Type element; cell_ptr next; }; /* Type for the list of cell pointers */ typedef cell_ptr List; /* Type for a pointer to an individual cell */ typedef cell_ptr Position; Tables 1-14

Data types for open hashing (continued) /* The underlying hash table stucture */ struct hash_tbl { Hash_size table_size; List *the_lists; /* this will be an array of lists, allocated later */ /* The lists will use headers, allocated later */ }; /* The hash table */ /* - a pointer to the hash table stucture */ typedef struct hash_tbl *Hash_Table; Tables 1-15

/* A hash function */ Function headers for open hashing Index hash( Key_Type key, Hash_size table_size ); /* Initialise a table of given size */ Hash_Table initialize_table( Hash_size table_size ); /* Find a key in a hash table */ Position find( Key_Type key, Hash_Table H ); /* Insert a key in a hash table */ void insert( Key_Type key, Hash_Table H ); /* Delete a key from a hash table */ void delete( Key_Type key, Hash_Table H ); Tables 1-16

Cost of hashing Search time for a key k has two components Calculating h(k) Searching the linked list for k Ignoring, for time being, calculation of h(k), search time depends on Maximum length of linked lists = Maximum number of collisions for any k Aim to choose size of hash table and hash function so that this is as small as possible Tables 1-17

Hash functions If hash table size is H, then need a function Should have properties It is easy to compute h : K 0...(H 1) For given values in K, values of h(k) are uniformly distributed over the range 0...(H 1) For example, no use having H = 1000 and all values of h(k) in the region 200 to 300. These properties are difficult to achieve, since the keys themselves are often not randomly distributed Tables 1-18

Example hash functions Simple additive function. Adds character values (as ints) and reduces mod the table size. Index hash1( Key_Type key, Hash_size table_size ) { unsigned int hash_val=0; while( *key!= \0 ) hash_val += *key++; } return( hash_val % table_size ); This function takes no account of the position of characters within the key e.g. cat and act hash to the same value. Tables 1-19

Example hash functions (continued) This function weights the second and third characters with powers of the size of the alphabet (in this case 27). It ignores all characters after the third. Index hash2( Key_Type key, Hash_size H_SIZE ) { int hash_val; if ( key[0] == \0 ) hash_val = 0; else if ( key[1] == \0 ) hash_val = key[0] % H_SIZE ; else if ( key[2] == \0 ) hash_val = (key[0] + 27*key[1]) % H_SIZE; else hash_val = (key[0] + 27*key[1] + 729*key[2]) % H_SIZE; return(hash_val); } Tables 1-20

Example hash functions (continued) For a key of length n, this function calculates the polynomial c 0 32 n 1 + c 1 32 n 2 +...c n 1 mod hashsize Index hash3( Key_Type key, Hash_size table_size ) { unsigned int hash_val=0; while( *key!= \0 ) hash_val = ( hash_val << 5 ) + *key++; } return( hash_val % table_size ); Tables 1-21

Example hash functions (continued) As hash3 except we avoid possible overflows by reducing mod table_size each time. Index hash4( Key_Type key, Hash_size table_size ) { unsigned int hash_val=0; while( *key!= \0 ) hash_val = ((hash_val << 5) + *key++) % table_size ; } return(hash_val); Tables 1-22

Hash function performance The following slides show the performance of the above hash functions under two sets of conditions Data set of 315 words, hash table size 157 Data set of 25144 words (/usr/dict/words), hash table size 3001 Tables 1-23

Hash function performance (continued) Chain lengths: sample - size 157 - function 1 50 45 40 35 Frequency 30 25 20 15 10 5 0 1 2 3 4 5 Chain length Tables 1-24

Hash function performance (continued) Chain lengths: sample - size 157 - function 2 55 50 45 40 Frequency 35 30 25 20 15 10 5 0 1 2 3 4 5 6 Chain length Tables 1-25

Hash function performance (continued) Chain lengths: sample - size 157 - function 3 45 40 35 30 Frequency 25 20 15 10 5 0 0 1 2 3 4 5 6 Chain length Tables 1-26

Hash function performance (continued) Chain lengths: sample - size 157 - function 4 50 45 40 35 Frequency 30 25 20 15 10 5 0 0 1 2 3 4 5 Chain length Tables 1-27

Hash function performance (continued) Chain lengths: words - size 3001 - function 1 1800 1600 1400 1200 Frequency 1000 800 600 400 200 0 0 20 40 60 80 100 120 Chain length Tables 1-28

Hash function performance (continued) Chain lengths: words - size 3001 - function 2 1200 1000 800 Frequency 600 400 200 0 0 50 100150200250300350400 Chain length Tables 1-29

Hash function performance (continued) Chain lengths: words - size 3001 - function 3 400 350 300 250 Frequency 200 150 100 50 0 0 5 10 15 20 25 30 35 40 45 50 Chain length Tables 1-30

Hash function performance (continued) Chain lengths: words - size 3001 - function 4 450 400 350 300 Frequency 250 200 150 100 50 0 0 5 10 15 20 25 30 35 40 45 50 Chain length Tables 1-31

Closed Hashing (Open addressing) Open hashing has several disadvantages uses pointers, so have time cost of dynamic memory allocation needs separate data structure for chains, and code to manage it Closed hashing has neither of these problems all data is stored in the hash table array itself when collision occurs, look elsewhere in the array for space Tables 1-32

Closed Hashing (Open addressing) (continued) Given key k, hash function hash, adopt following procedure to search for k. Look at the location s 0 given by hash(k) If k found at location s 0, return s 0 Otherwise, if location s 0 is empty, halt search, since k is not in the table Otherwise, use second function p to find s 1, given by s 1 = hash(k) + p(1) (mod HashSize) Repeat above with s i = hash(k) + p(i) (mod HashSize) until either find k or an empty cell. Tables 1-33

Closed Hashing (Open addressing) (continued) The function p called collision resolution function or probe function Simplest function is p(i) = i So s i = hash(k) + i Start at s 0 and look at consecutive cells until find space. This is linear probing Tables 1-34

Hash table insertion linear probing This example uses integer keys and the hash function h(k) = k mod hash_size Data to be inserted is 23 3 2 22 45 68 24 47 91 Tables 1-35

Using linear probing gives insertions ---> 23 3 2 22 45 68 24 47 91 ============================================================ 0 23 23 23 23 23 23 23 23 23 1 45 45 45 45 45 ˆ 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 68 68 68 68 H 5 24 24 24 A 6 47 47 S 7 91 H 8 9 T 10 A 11 B 12 L 13 E 14 15 16 17 v 18 19 20 21 22 22 22 22 22 22 22 ============================================================ Hash table insertion linear probing (continued) Everything OK until try to insert 45. 45 mod 23 = 22 full 22 + 1 mod 23 = 0 full 22 + 2 mod 23 = 1 insert Tables 1-36

Hash table insertion linear probing (continued) Note that this produces a block of occupied cells this effect is known as clustering. Using linear probing means that if a new key is hashed into the cluster, need several probes to find space for it, and it then adds to the cluster. One way to avoid this is to use quadratic probing. p(i) = i 2 Tables 1-37

Hash table insertion quadratic probing Same data and hash function as before, but using quadratic probing ============================================================ 0 23 23 23 23 23 23 23 23 23 1 24 24 24 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 Again everything OK until try 4 5 47 47 6 7 8 45 45 45 45 45 9 10 11 12 91 13 14 15 68 68 68 68 16 17 18 19 20 21 22 22 22 22 22 22 22 ============================================================ to insert 45. 45 mod 23 = 22 full (22 + 1 2 ) mod 23 = 0 full (22 + 2 2 ) mod 23 = 3 full (22 + 3 2 ) mod 23 = 8 insert Tables 1-38

Quadratic probing Linear probing is guaranteed to find an empty space if one exists. Not the case for quadratic probing Consider the following hash table. It isn t full, but an attempt to insert 30, using hash function as before and quadratic probing, will fail. Tables 1-39

=========== 0 10 1 21 2 3 4 34 5 45 6 56 7 8 9 69 =========== 30 mod 10 = 0 full (0 + 1 2 ) mod 10 = 1 full (0 + 2 2 ) mod 10 = 4 full (0 + 3 2 ) mod 10 = 9 full (0 + 4 2 ) mod 10 = 6 full (0 + 5 2 ) mod 10 = 5 full (0 + 6 2 ) mod 10 = 6 full (0 + 7 2 ) mod 10 = 9 full Never get any values other than 0, 4, 5, 6, 9, so never hit free space Tables 1-40

Quadratic probing (continued) Can be shown that can guarantee to find a place for a new key k using quadratic probing if following conditions hold The table size is prime The table is no more than half full This means that, if quadratic probing used, need table size to be at least twice as big as maximum number of data items Tables 1-41

Double hashing Another way of reducing the clustering problem, without the disadvantages of quadratic probing is double hashing. Choose a second hash function, h 2, and use the collision resolution function p(i) = h 2 (k) i Need to make sure that h 2 (k) never takes the value 0 (Why?) If the table size is prime and p is some prime smaller than the hash table size, the function h 2 (k) = p (k mod p) usually works well. (Note that we are assuming integer keys here, but can easily be generalised to string keys) Tables 1-42

Double hashing (continued) Using double hashing with the above example, with secondary hash function p (k mod p) with p = 7 Can insert 10, 21, 34, 45, 56, 69 with no problems now try to insert 70 and 19 Tables 1-43

Double hashing (continued) ======================== 0 10 10 10 1 21 21 21 2 3 19 4 34 34 34 5 45 45 45 6 56 56 56 7 70 70 8 9 69 69 69 ======================== h 2 (70) = 7 (70 mod 7) = 7 70 mod 10 = 0 full (0 + 7) mod 10 = 7 insert h 2 (19) = 7 (19 mod 7) = 2 19 mod 10 = 9 full (9 + 2) mod 10 = 1 full (9 + 2 2) mod 10 = 3 insert Tables 1-44

Data types for closed hashing /* Type for size of hash table */ typedef unsigned int Hash_size; /* Type for index into hash table */ typedef unsigned int Index; /* Type for state of hash table entry */ enum kind_of_entry { legitimate, empty, deleted }; /* Type for hash table entry */ struct hash_entry { Key_Type element; enum kind_of_entry state; }; Tables 1-45

typedef struct hash_entry cell; /* The underlying hash table stucture */ struct hash_tbl { unsigned int table_size; cell *the_cells; /* an array of hash_entry cells, * allocated during initialisation */ }; /* The hash table */ /* - a pointer to the hash table stucture */ typedef struct hash_tbl *Hash_Table; Tables 1-46

Rehashing Two major disadvantages to closed hashing Need to know upper bound on amount of data before building the hash table As the hash table fills up, more clashes occur and insertion becomes more expensive. At some stage insertion will fail (at best when the table is full). Rather than make the hash table extremely large in first place, can get round both problems by rehashing. This involves building a new hash table double the size of original, inserting all data from the original hash table into the new table, and freeing memory resources taken by old table. Tables 1-47

Rehashing (continued) This means that we are not limited by size of data expected in advance, Can rehash if run out of space. Rehash can take place either When becomes full probably a bit late When table loading reaches some predetermined factor (70% often used) Rehashing should be triggered automatically by an insert, without user intervention. Easy to implement Tables 1-48