Hash table basics mod 83 ate. ate

Similar documents
Hash table basics. ate à. à à mod à 83

Hash table basics mod 83 ate. ate

Hash table basics mod 83 ate. ate. hashcode()

THINGS WE DID LAST TIME IN SECTION

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

HASH TABLES. Hash Tables Page 1

CS 310 Advanced Data Structures and Algorithms

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Cpt S 223. School of EECS, WSU

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Hash Tables. Gunnar Gotshalks. Maps 1

Standard ADTs. Lecture 19 CS2110 Summer 2009

Lecture 18. Collision Resolution

Hash[ string key ] ==> integer value

HASH TABLES cs2420 Introduction to Algorithms and Data Structures Spring 2015

Priority Queues Heaps Heapsort

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

CSC 321: Data Structures. Fall 2016

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

BINARY HEAP cs2420 Introduction to Algorithms and Data Structures Spring 2015

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Hashing Techniques. Material based on slides by George Bebis

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Priority Queue. 03/09/04 Lecture 17 1

Hash Table and Hashing

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Why do we need hashing?

TABLES AND HASHING. Chapter 13

CSC 321: Data Structures. Fall 2017

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Announcements. Hash Functions. Hash Functions 4/17/18 HASHING

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

MA/CSSE 473 Day 23. Binary (max) Heap Quick Review

Algorithms and Data Structures

Data Structures and Object-Oriented Design VIII. Spring 2014 Carola Wenk

Hash tables. hashing -- idea collision resolution. hash function Java hashcode() for HashMap and HashSet big-o time bounds applications

AAL 217: DATA STRUCTURES

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

CSE 214 Computer Science II Searching

Announcements. Container structures so far. IntSet ADT interface. Sets. Today s topic: Hashing (Ch. 10) Next topic: Graphs. Break around 11:45am

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

COMP171. Hashing.

CPSC 259 admin notes

Hash Tables. Hashing Probing Separate Chaining Hash Function

Adapted By Manik Hosen

Abstract Data Types (ADTs) Queues & Priority Queues. Sets. Dictionaries. Stacks 6/15/2011

STANDARD ADTS Lecture 17 CS2110 Spring 2013

CS 3410 Ch 20 Hash Tables

Introduction to Hashing

CS 206 Introduction to Computer Science II

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

More on Hashing: Collisions. See Chapter 20 of the text.

Open Addressing: Linear Probing (cont.)

Announcements. Submit Prelim 2 conflicts by Thursday night A6 is due Nov 7 (tomorrow!)

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

CSE 143. Lecture 28: Hashing

Topic 22 Hash Tables

Announcements. Today s topic: Hashing (Ch. 10) Next topic: Graphs. Break around 11:45am

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Fast Lookup: Hash tables

csci 210: Data Structures Maps and Hash Tables

HASH TABLES. Goal is to store elements k,v at index i = h k

UNIT III BALANCED SEARCH TREES AND INDEXING

9/24/ Hash functions

DATA STRUCTURES AND ALGORITHMS

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

CMSC 341 Hashing. Based on slides from previous iterations of this course

Lecture 16 More on Hashing Collision Resolution

CS2210 Data Structures and Algorithms

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

CSI33 Data Structures

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Lecture 4. Hashing Methods

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

CMSC 132: Object-Oriented Programming II. Hash Tables

CS 310: Hash Table Collision Resolution

Preview. A hash function is a function that:

Linked lists (6.5, 16)

Lecture 5 Data Structures (DAT037) Ramona Enache (with slides from Nick Smallbone)

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

HASH TABLES. CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast. An Ideal Hash Functions.

Review. CSE 143 Java. A Magical Strategy. Hash Function Example. Want to implement Sets of objects Want fast contains( ), add( )

Hashing as a Dictionary Implementation

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Hash-Based Indexes. Chapter 11

Algorithms and Data Structures

Transcription:

Hash table basics After today, you should be able to explain how hash tables perform insertion in amortized O(1) time given enough space ate hashcode() 82 83 84 48594983 mod 83 ate

1. Section 2: 15+ min with CSSE Board of Advisors, looking for candid student feedback about the CSSE department 2. Then: 1. EditorTrees M1 discussion Unit test points and commits 2. HW6 discussion 3. Questions on HW6?

Efficiently putting 5 pounds of data in a 20 pound bag

Implementation choices: TreeSet (and TreeMap) uses a balanced tree: O(log n) Uses a red-black tree HashSet (and HashMap) uses a hash table: amortized O(1) time Related: maps allow insertion, retrieval, and deletion of items by key: Since keys are unique, they form a set. The values just go along for the ride. We ll focus on sets.

ate hashcode() 82 83 84 48594983 mod 83 ate 1. The underlying storage? Growable array 2. Calculate the index to store an item from the item itself. How? Hashcode. Fast but un-ordered. 3. What if that location is already occupied with another item? Collision. Two methods to resolve

1 Array of size m n elements with unique keys If n m, then use the key as an array index. Clearly O(1) lookup of keys Issues? Keys must be unique. Often the range of potential keys is much larger than the storage we want for an array Example: RHIT student IDs vs. # Rose students Diagram from John Morris, University of Western Australia

2 key hashcode() integer Objects that are.equals() MUST have the same hashcode values A good hashcode() also is fast to calculate and distributes the keys, like: hashcode( ate )= 48594983 hashcode( ape )= -76849201 (can be negative if overflows) hashcode( awe ) = 14893202

Example: if m = 100: hashcode( ate )= 48594983 hashcode( ape )= -76849201 hashcode( awe ) = 1489036 mod 83 46* 36 *Note: since the hashcode is an integer, it might be negative, and negative numbers have negative remainders. Trick: If it is negative, add Integer.MAX_VALUE to make it positive before you mod.

3-4 How Java s hashcode() is used: ate hashcode() 82 83 84 48594983 mod 83 ate Unless this position is already occupied a collision

5 Default if you inherit Object s: memory location Many JDK classes override hashcode() Integer: the value itself Double: XOR first 32 bits with last 32 bits String: we ll see shortly! Date, URL,... Custom classes should override hashcode() Use a combination of final fields. If key is based on mutable field, then the hashcode will change and you will lose it! People usually use strings if possible.

// This could be in the String class public static int hash(string s) { int total = 0; for (int i=0; i<s.length(); i++) total = total + s.charat(i); return total; } Advantages? Disadvantages?

// This could be in the String class public static int hash(string s) { int total = 0; for (int i=0; i<s.length(); i++) total = total*256 + s.charat(i); return total; } Spreads out the values more, and anagrams not an issue. What about overflow during computation? What happens to first characters?

6 // This could be in the String class public static int hash(string s) { int total = 0; for (int i=0; i<s.length(); i++) total = total*31 + s.charat(i); return total; } Spread out, anagrams OK, overflow OK. This is String s hashcode() method. The (x = 31x + y) pattern is a good one to follow.

7 ate hashcode() 82 83 84 48594983 mod 83 ate A good hashcode distributes keys evenly, but collisions will still happen hashcode() are ints only ~4 billion unique values. How many 16 character ASCII strings are possible? If n is small, tables should be much smaller mod will cause collisions too! Solutions: Chaining Probing (Linear, Quadratic)

8 Grow in another direction Examples:.get( at ),.get( him), (hashcode=18),.add( him ),.delete( with ) Java s HashMap uses chaining and a table size that is a power of 2.

9-10 m array slots, n items. Load factor, l=n/m. Runtime = O(l) Space-time trade-off 1. If m constant, then this is O(n). Why? 2. If keep m~0.5n (by doubling), then this is amortized O(1). Why?

No need to grow in second direction No memory required for pointers Historically, this was important! Still is for some data Will still need to keep load factor (l=n/m) low or else collisions degrade performance We ll grow the array again

11 Probe H (see if it causes a collision) Collision? Also probe the next available space: Try H, H+1, H+2, H+3, Wraparound at the end of the array Example on board:.add() and.get() Problem: Clustering Animation: http://www.cs.auckland.ac.nz/software/alganim/has h_tables.html

Figure 20.4 Linear probing hash table after each insertion Good example of clustering and wraparound Data Structures & Problem Solving using JAVA/2E Mark Allen Weiss 2002 Addison Wesley

For probing to work, 0 l 1. For a given l, what is the expected number of probes before an empty location is found?

Assume all locations are equally likely to be occupied, and equally likely to be the next one we look at. Then the probability that a given cell is full is l and probability that a given cell is empty is 1-l. What s the expected number? 12

For a proof, see Knuth, The Art of Computer Programming, Vol 3: Searching Sorting, 2nd ed, Addision-Wesley, Reading, MA, 1998. (1 st edition = 1968) 13 Clustering! Blocks of occupied cells are formed Any collision in a block makes the block bigger Two sources of collisions: Identical hash values Hash values that hit a cluster Actual average number of probes for large l:

Easy to implement Works well when load factor is low In practice, once l > 0.5, we usually double the size of the array and rehash This is more efficient than letting the load factor get high

Reminder: Linear probing: Collision at H? Try H, H+1, H+2, H+3,... New: Quadratic probing: Collision at H? Try H, H+1 2. H+2 2, H+3 2,... Eliminates primary clustering. Secondary clustering isn t as problematic

14 Choose a prime number for the array size, m Then if λ 0.5: Guaranteed insertion If there is a hole, we ll find it So no cell is probed twice Can show with m=17, H=6. For a proof, see Theorem 20.4: Suppose that we repeat a probe before trying more than half the slots in the table See that this leads to a contradiction Contradicts fact that the table size is prime

No one has been able to analyze it! Experimental data shows that it works well Provided that the array size is prime, and l < 0.5

15-17 Structure insert Find value Find max value Unsorted array Sorted array Balanced BST Hash table Finish the quiz. Then check your answers with the next slide

Structure insert Find value Find max value Unsorted array Amortized q(1) q(n) q(n) Sorted array q(n) q(log n) q(1) Balanced BST q(log n) q(log n) q(log n) Hash table Amortized q(1) q(1) q(n)

Why use 31 and not 256 as a base in the String hash function? Consider chaining, linear probing, and quadratic probing. What is the purpose of all of these? For which can the load factor go over 1? For which should the table size be prime to avoid probing the same cell twice? For which is the table size a power of 2? For which is clustering a major problem? For which must we grow the array and rehash every element when the load factor is high?

Constants matter! 727MB data, ~190M elements Many inserts, followed by many finds Microsoft's C++ STL Structure build (seconds) Size (MB) 100k finds (seconds) Hash map 22 6,150 24 Tree map 114 3,500 127 Sorted array 17 727 25 Why? Sorted arrays are nice if they don t have to be updated frequently! Trees still nice when interleaved insert/find