CPSC 259 admin notes

Similar documents
Unit #5: Hash Functions and the Pigeonhole Principle

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

AAL 217: DATA STRUCTURES

COMP171. Hashing.

Open Addressing: Linear Probing (cont.)

TABLES AND HASHING. Chapter 13

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

CMSC 341 Hashing. Based on slides from previous iterations of this course

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Algorithms and Data Structures

Understand how to deal with collisions

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Lecture 16 More on Hashing Collision Resolution

Hashing. Hashing Procedures

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Hash table basics. ate à. à à mod à 83

Hash Table and Hashing

Lecture 18. Collision Resolution

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

CS/COE 1501

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Hashing Techniques. Material based on slides by George Bebis

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Hash Tables. Hashing Probing Separate Chaining Hash Function

Hash table basics mod 83 ate. ate. hashcode()

Hash table basics mod 83 ate. ate

CSI33 Data Structures

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Question Bank Subject: Advanced Data Structures Class: SE Computer

Adapted By Manik Hosen

Introduction to Hashing

Data and File Structures Chapter 11. Hashing

Comp 335 File Structures. Hashing

Hash table basics mod 83 ate. ate

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Hash Tables. Gunnar Gotshalks. Maps 1

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

HASH TABLES. Goal is to store elements k,v at index i = h k

CSE 326: Data Structures Lecture #11 B-Trees

Outline. Computer Science 331. Desirable Properties of Hash Functions. What is a Hash Function? Hash Functions. Mike Jacobson.

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Hashing for searching

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

CSE 214 Computer Science II Searching

HASH TABLES. Hash Tables Page 1

Hashing 1. Searching Lists

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Lecture 7: Efficient Collections via Hashing

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

CSCI 104 Hash Tables & Functions. Mark Redekopp David Kempe

BBM371& Data*Management. Lecture 6: Hash Tables

Hash Tables and Hash Functions

Cpt S 223. School of EECS, WSU

HASH TABLES cs2420 Introduction to Algorithms and Data Structures Spring 2015

More on Hashing: Collisions. See Chapter 20 of the text.

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

HASH TABLES.

Hashing HASHING HOW? Ordered Operations. Typical Hash Function. Why not discard other data structures?

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Fast Lookup: Hash tables

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

CS 241 Analysis of Algorithms

Topic HashTable and Table ADT

CS 310 Advanced Data Structures and Algorithms

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Hash[ string key ] ==> integer value

Fundamental Algorithms

Dictionaries and Hash Tables

Chapter 20 Hash Tables

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Hash Tables: Handling Collisions

9/24/ Hash functions

Computer Science Foundation Exam

CPSC 211, Sections : Data Structures and Implementations, Honors Final Exam May 4, 2001

Data Structures. Topic #6

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Data Structures & File Management

Data Structures And Algorithms

Introduction To Hashing

CSE 2320 Notes 13: Hashing

CSC 321: Data Structures. Fall 2016

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

CSCE 2014 Final Exam Spring Version A

1. Attempt any three of the following: 15

Symbol Tables. ASU Textbook Chapter 7.6, 6.5 and 6.3. Tsan-sheng Hsu.

SFU CMPT Lecture: Week 8

CSC 261/461 Database Systems Lecture 17. Fall 2017

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Successful vs. Unsuccessful

Algorithms and Data Structures

Transcription:

CPSC 9 admin notes! TAs Office hours next week! Monday during LA 9 - Pearl! Monday during LB Andrew! Monday during LF Marika! Monday during LE Angad! Tuesday during LH 9 Giorgio! Tuesday during LG - Pearl! Friday during LC 9 - Pearl! Friday during LD Lawrence! Online quizzes are due next Friday Nov 8 CPSC 9 Hashing Page

CPSC 9: Data Structures and Algorithms for Electrical Engineers Hashing Textbook Reference: Thareja first edition: Chapter, pages -7 Thareja second edition: Chapter, pages -88 Instructor: Dr. Hassan Khosravi September-December CPSC 9 Hashing Page

Learning Goals Provide examples of the types of problems that can benefit from a hash data structure. Compare and contrast open addressing and chaining. Evaluate collision resolution policies. Describe the conditions under which hashing can degenerate from O() expected complexity to O(n). Determine the characteristics of a good hash function (e.g., uniform distribution, fast to compute, appropriate table size). Determine how well a hashing solution scales. Explain how to solve scalability problems. CPSC 9 Hashing Page

weeks journey filled with data structures, complexity, and algorithms! We have learned! Pointers, struct, and dynamic memory allocation.! Computing the complexity of our algorithms! Linked lists, Stack and Queue! Recursion and applications! Binary trees and Binary search trees! Binary heaps and heapsorts! Sorting CPSC 9 Hashing Page

Hashing A data structure that allows O() search, insert and delete Biometrics and Identity Management Digital signatures CPSC 9 Public-Key Infrastructure Hashing Page

CPSC 9 Hashing Page

Hashing CPSC 9 Hashing Page 7

Introductory Examples! We have studied two search algorithms! Linear search has running time proportional to O(n)! Binary search takes time proportional to O(log n)! Can we do better?! Example (natural, numeric keys): In a small company of employees, each employee is assigned an Emp_ID number in the range 99. To store the employee s records in an array, each employee s Emp_ID number acts as an index into the array where this employee s record will be stored as shown in figure KEY ARRAY OF EMPLOYEE S RECORD Key [] Record of employee having Emp_ID Key [] Record of employee having Emp_ID.. Key 99 [99] Record of employee having Emp_ID 99 CPSC 9 Hashing Page 8

Introductory Examples! Let s assume that the same company uses a five digit Emp_ID number as the primary key. In this case, key values will range from to 99999. If we want to use the same technique as above, we will need an array of size,, of which only elements will be used. KEY ARRAY OF EMPLOYEE S RECORD Key [] Record of employee having Emp_ID. Key n [n] Record of employee having Emp_ID n.. Key 99999 [99999] Record of employee having Emp_ID 99999! It is impractical to waste that much storage just to ensure that each employee s record is in a unique and predictable location. CPSC 9 Hashing Page 9

Introductory Examples! Another possible option is to use just the last two digits of key to identify each employee.! Employee with Emp_ID number 799 is stored at index 9.! Employee with Emp_ID number is stored at index.! Would this be a good idea? Why? Why not? CPSC 9 Hashing Page

Example (characters): Suppose we want to design a frequency table for counting the number of times that each character appears in a file. We can use the character s ASCII code (8 bits = possibilities) as the key to index into an array of frequencies (the values): e.g., A = a = 97; = 8 97 In this example, our array of frequencies will have size : one for each of the different ASCII characters. It may well be that not all of these different characters are in a given file, and so we are potentially wasting memory. What we gain is the benefit of accessing frequencies in O() time. Is this a good time/space trade-off? Suppose we make the call: int freqa = getfrequency( A ); Since the ASCII code for A has the numeric value, freqa is set to the frequency found at array index. CPSC 9 Hashing Page

Example (long numeric keys): Suppose we want to keep track of Visa credit card accounts. Visa card numbers are long: digits allowing for a total of different numbers. If we were to build an array capable of holding accounts, we would have way more accounts than there are people on Earth! (We couldn t hold the entire table in memory, either.) It is not reasonable to use the Visa credit card number to index into an array. We ll see how to solve this problem momentarily. Example (strings): What if the key is a string? For example, social insurance numbers in the United Kingdom are alphanumeric. We cannot use a non-numeric string to index into an array. Solution: Convert String to int? Hashing solves these problems. How? We define a function that transforms keys into a numeric array index. Such a function is called a hash function. CPSC 9 Hashing Page

Hash function! Hash Function, h is simply a mathematical formula which when applied to the key, produces an integer which can be used as an index for the key in the hash table! The goal of a hash function is to produce a unique set of integers within some suitable range to disperse the keys in an apparently random, uniform way. Why? Avoid clusters (collisions) O() access! As much as we would like to produce unique keys, it is not always possible, so we might have two or more keys that map to the same memory location. This is called collision. CPSC 9 Hashing Page

Hash Table Terminology hash function Alan Steve Hassan Kendra Ed f(x) collision keys load factor λ = # of entries in table tablesize CPSC 9 Hashing Page

A Good Hash Function! Is easy (fast) to compute; O()! Distributes the data evenly (hash(a) hash(b), as much as possible).! Uses the whole hash table! (for all k < size, there s an i such that hash(i) % size = k). CPSC 9 Hashing Page

Strategies for Building Hash Functions Truncation: Use only part of the key as the hash index. For example, if we want to build a table of student records at UBC, then rather than using the entire student number (8 digits) as the key, we could use only the first digits. We would therefore require an array with space for up to student records. Note that some students may have the first digits of their student number in common, and so collisions would occur. Folding: Partition the key into parts and combine the parts using arithmetic operations. For example, we could take a Visa number and partition it into one digit number, plus two digit numbers. We could then add these three numbers together to come up with a hash index. CPSC 9 Hashing Page

Strategies for Building Hash Functions Modular Arithmetic: Use the modulus (remainder) operator (% in C) to produce a hash index in a given range. % = For example, suppose a company with employees wants to use each employee s Social Insurance Number as a key. We might use an array (hash table) of size, thereby leaving room for expansion. We can then generate a hash index from the Social Insurance Number as follows: hashindex = SIN % ; Again, we would expect this hash function to lead to collisions. 7 78 99 % = 99 99 99 CPSC 9 Hashing Page 7

Alphanumeric Keys Suppose we have a table capable of holding records, and whose keys consist of strings that are characters long. We can apply numeric operations to the ASCII codes of the characters in the string in order to determine a hash index: int hash(char * key){! int hashcode = ;! int index = ;! while (key[index]!= \ ){! hashcode += (int)key[index];! index++;! }! return hashcode % ;! }! C A M E C O / C = 7 A = M = 77 E = 9 C = 7 O = 79 = CAMECO 999 CPSC 9 Hashing Page 8

Alphanumeric Keys What is a significant problem with this approach? Hash of any string with the same six letters is the same ASCII values have a max of * =, which means [ 999] are wasted A solution: hashcode = ;! while (key[index]!= \ ){! hashcode = *hashcode + key[index];! index++;! } ((((((7) + )+ 77)+9)+7) + 79) % 999= 89 CPSC 9 Hashing Page 9

Collisions If a hash function h maps two different keys x and y to the same index (i.e., x y and h(x) = h(y)), then x and y collide. A perfect hash function causes no collisions. Unfortunately, creating a perfect hash function requires knowledge of what keys will be hashed. Even a hash function that distributes items randomly (why is this not a practical hash function?) will cause collisions even when the number of items hashed is small. To see why, consider the birthday paradox from statistics: The probability of finding or more people that share the same birthday, in a room of n people, is > % if n. So, we must design a collision handling scheme. CPSC 9 Hashing Page

The Pigeonhole Principle (informal) You can t put k+ pigeons into k holes without putting two pigeons in the same hole. Image by en:user:mckay, used under CC attr/share-alike. This place just isn t coo anymore. Pigeonhole principle says we can t avoid all collisions try to hash without collision m keys into n slots with m > n try to put pigeons into holes CPSC 9 Hashing Page

Clicker question Suppose we have colours of Halloween candy, and that there s lots of candy in a bag. How many pieces of candy do we have to pull out of the bag if we want to be sure to get of the same colour? a. b. c. d. 8 e. None of these CPSC 9 Hashing Page

Clicker question (answer) Suppose we have colours of Halloween candy, and that there s lots of candy in a bag. How many pieces of candy do we have to pull out of the bag if we want to be sure to get of the same colour? a. b. c. d. 8 e. None of these CPSC 9 Hashing Page

Collision Resolution! Pigeonhole principle says we can t avoid all collisions! try to hash without collision m keys into n slots with m > n! try to put pigeons into holes! What do we do when two keys hash to the same entry?! chaining: put a linked list in each entry shove extra pigeons in one hole!! open addressing: pick a next entry to try CPSC 9 Hashing Page

Hashing Method #: Chaining Each table location points to a linked list (chain) of items that hash to this location. When we search for an item, we apply the hash function to get its hash index and then perform a linear search of the linked list. If the item is not in the list, then it s not in the table. Properties λ can be greater than performance degrades with length of chains a e c load factor λ = h(a) = h(d) h(e) = h(b) d b # of entries in table tablesize CPSC 9 Hashing Page

In-class exercise Example: Suppose h(x) = x/ mod Hash:, 88, 9,, 99, 9, 9 9 88 99 Example: find node with key CPSC 9 Hashing Page

Deleting when using chaining Example: Suppose h(x) = x/ mod Hash:, 88,, 99, 88 99 Delete Remove from the linked list 88 99 CPSC 9 Hashing Page 7

Pros and cons of chaining Advantages of Chaining: The size s of the hash table can be smaller than the number of items n hashed. Why is this often a good thing? Fewer blank/wasted cells (especially in the case where the number of cells greatly exceeds the number of keys). Collision handling is easy: O(). Can accommodate overflows Disadvantages of Chaining: Search time can become O(n) due to long chains. Overhead of pointers: extra space, extra work. CPSC 9 Hashing Page 8

CPSC 9 admin notes! TAs Office hours for today! Friday during LC 9 - Pearl! Friday during LD Lawrence! My office hour! Monday -! Online quizzes are due tonight :9pm! Participation (clicker) grades are now available on Connect! %, Check your grade. I still have active clicker that has not been registered.! Final CPSC9 DEC, : PM OSBO A! Other than a calculator, no electronic equipment is allowed. You may not use your cell phone as a calculator. CPSC 9 Hashing Page 9

Hashing Method #: Open Addressing There is no linked list here. We have a hash table containing key-value pairs, and we handle collisions by trying a sequence of table entries until either the item is found in the table or we reach an empty cell. Properties λ performance degrades with difficulty of finding right spot h(a) = h(d) h(e) = h(b) a d e b c load factor λ = # of entries in table tablesize CPSC 9 Hashing Page

Probing Probing sequence is: h(k) mod size If h(k) is occupied, try (h(k) + f()) mod size If (h(k) + f()) mod size is occupied, try (h(k) + f()) mod size And so forth Probing properties the i th probe is to (h(k) + f(i)) mod size where f() = if i reaches size, the insert has failed depending on f(), the insert may fail sooner long sequences of probes are costly! CPSC 9 Hashing Page

Probe sequence is h(k) mod size (h(k) + ) mod size (h(k) + ) mod size Linear probing f(i) = i For any λ <, linear probing will find an empty slot CPSC 9 Hashing Page

In-class exercise Using the hash function h(x) = x % 7 insert the following values using linear probing: 7, 9,, 7,, insert(7) 7%7 = insert(9) 9%7 = insert() %7 = insert(7) 7%7 = insert() %7 = insert() %7 = 7 7 7 9 9 9 9 9 7 7 7 7 7 7 probes: CPSC 9 Hashing Page

Problems with linear probing! The algorithm results in clustering! Where there has been one collision there will be more! Clusters of consecutive cells are formed and the time required for a search increases with the size of the cluster.! Also, when a new value has to be inserted in to the table at a position which is already occupied, that value is inserted at the end of the cluster, which all the more increases the length of the cluster.! This phenomenon is called primary clustering. To avoid primary clustering, other techniques like quadratic probing and double hashing are used. CPSC 9 Hashing Page

Quadratic Probing f(i) = i Probe sequence is h(k) mod size (h(k) + ) mod size (h(k) + ) mod size (h(k) + 9) mod size CPSC 9 Hashing Page

Quadratic Probing Example Using the hash function h(x) = x % 7 insert the following values using quadratic probing: 7,, 8,, insert(7) 7%7 = insert() %7 = insert(8) 8%7 = insert() %7 = insert() %7 = 8 8 8 7 7 7 7 7 probes: CPSC 9 Hashing Page

Quadratic Probing Example L Using the hash function h(x) = x % 7 insert the following values using quadratic probing: 7, 9,,, 7 insert(7) 7%7 = insert(9) 9%7 = insert() %7 = insert() %7 = insert(7) 7%7 = 9 9 9 9 7 7 7 7 7 probes: CPSC 9 Hashing Page 7

Example: h(x) = x mod Clicker question Hash these keys,,,, using quadratic probing. Which cell in the array does key hash to? A. B. C. D. 7 E. 9 7 8 9 CPSC 9 Hashing Page 8

Example: h(x) = x mod Clicker question (answer) Hash these keys,,,, using quadratic probing. Which cell in the array does key hash to? A. B. C. D. 7 E. 9 7 8 9 CPSC 9 Hashing Page 9

Problems with quadratic probing! Quadratic probing resolves to the primary clustering problem that exists in linear probing! One of the major drawbacks with quadratic probing is that a sequence of successive probes may only explore a fraction of the table, and this fraction may be quite small! we will not be able to find an empty location in the table despite the fact that the table is by no means full (See example from few pages ago).! This phenomenon is called secondary clustering: if there is a collision between two keys then the same probe sequence will be followed for both. CPSC 9 Hashing Page

Double Hashing f(i) = i hash (k) Probe sequence is h (k) mod size (h (k) + h (k)) mod size (h (k) + h (k)) mod size A good second hash function is quick to evaluate. differs from the original hash function. never evaluates to (mod size). CPSC 9 Hashing Page

Double Hashing Example Using the hash functions h (x) = x % 7 and h (x)= - (x % ) insert the following values using double hashing 7, 9,, 7,, insert(7) 7%7 = insert(9) 9%7 = insert() %7 = insert(7) insert() 7%7 = %7 = - (7%) = insert() %7 = - (%) = probes: 7 9 9 7 7 CPSC 9 Hashing Page 7 9 7 7 9 7 7 9 7

Clicker question The primary hash function is: h (k) = (k + ) mod. The secondary hash function is: h (k) = 7 (k mod 7) Hash these keys, in this order:,,, 88,, 9,. Which cell in the array does key hash to? A. B. C. D. E. 7 8 9 CPSC 9 Hashing Page

Clicker question (answer) h (k) = (k + ) mod. h (k) = 7 (k mod 7),,, 88,, 9,. Which cell in the array does key hash to? h() = (() + ) % = 7 A. B. C. D. E. h() = (() + ) % = h() = (() + ) % = 9 h(88) = ((88) + ) % = +7 88%7 = 8 h() = (() + ) % = 7 + 7 %7 = h(9) = ((9) + ) % = h() = (() + ) % = +( 7 %7) = 7 8 9 9 88 CPSC 9 Hashing Page

Deleting when using probing Example: Suppose locations [97] to [] are occupied in our hash table: [] [] [97] [98] [99] [] [] [] dog cat goat owl deer Suppose a new key hashes to [97]. Assuming a linear collision resolution policy, the key goes to. Later, suppose we delete the key that was hashed to [98]. Add a tombstone (i.e., flag, marker) for reasons:. If searching, keep going when you hit a tombstone.. If inserting, stop and the add item here. This means that table entries can be occupied, deleted, or free CPSC 9 Hashing Page

Open Addressing summarized Advantage of Open Addressing No memory is wasted on pointers. Performance of Hashing Performance depends on: the quality of the hash function (includes N) the collision resolution algorithm the available space in the hash table To estimate how full a table is, we calculate the load factor α: α = ( # entries in table ) / ( table size ) For open addressing, α. For chaining, α <. CPSC 9 Hashing Page

Learning Goals revisited Provide examples of the types of problems that can benefit from a hash data structure. Compare and contrast open addressing and chaining. Evaluate collision resolution policies. Describe the conditions under which hashing can degenerate from O() expected complexity to O(n). Identify the types of search problems that do not benefit from hashing (e.g., range searching) and explain why. Determine the characteristics of a good hash function (e.g., uniform distribution, onto function, appropriate table size). Determine how well a hashing solution scales. Explain how to solve scalability problems. CPSC 9 Hashing Page 7