Announcements. Container structures so far. IntSet ADT interface. Sets. Today s topic: Hashing (Ch. 10) Next topic: Graphs. Break around 11:45am

Similar documents
Announcements. Today s topic: Hashing (Ch. 10) Next topic: Graphs. Break around 11:45am

CSE 143. Lecture 28: Hashing

Building Java Programs

Review. CSE 143 Java. A Magical Strategy. Hash Function Example. Want to implement Sets of objects Want fast contains( ), add( )

Hash tables. hashing -- idea collision resolution. hash function Java hashcode() for HashMap and HashSet big-o time bounds applications

EXAMINATIONS 2015 COMP103 INTRODUCTION TO DATA STRUCTURES AND ALGORITHMS

Hash table basics mod 83 ate. ate. hashcode()

Hash table basics mod 83 ate. ate

CSC 321: Data Structures. Fall 2016

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Hash table basics. ate à. à à mod à 83

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

csci 210: Data Structures Maps and Hash Tables

Hash table basics mod 83 ate. ate

Lecture 10: Introduction to Hash Tables

EXAMINATIONS 2016 TRIMESTER 2

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

EXAMINATIONS 2011 Trimester 2, MID-TERM TEST. COMP103 Introduction to Data Structures and Algorithms SOLUTIONS

Standard ADTs. Lecture 19 CS2110 Summer 2009

CSC 321: Data Structures. Fall 2017

Implementing Hash and AVL

EXAMINATIONS 2017 TRIMESTER 2

Abstract data types (again) Announcements. Example ADT an integer bag (next) The Java Collections Framework

Announcements. Submit Prelim 2 conflicts by Thursday night A6 is due Nov 7 (tomorrow!)

SOLUTIONS. COMP103 Introduction to Data Structures and Algorithms

EXAMINATIONS 2012 MID YEAR. COMP103 Introduction to Data Structures and Algorithms SOLUTIONS

Mapping Structures. Chapter An Example: Language Dictionaries

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

Hashing as a Dictionary Implementation

1.00 Lecture 32. Hashing. Reading for next time: Big Java Motivation

Programming Languages and Techniques (CIS120)

Hash Tables. Gunnar Gotshalks. Maps 1

Data Structures and Object-Oriented Design VIII. Spring 2014 Carola Wenk

Model Solutions. COMP 103: Test May, 2013

Java HashMap Interview Questions

Topic HashTable and Table ADT

11/27/12. CS202 Fall 2012 Lecture 11/15. Hashing. What: WiCS CS Courses: Inside Scoop When: Monday, Nov 19th from 5-7pm Where: SEO 1000

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Announcements. Hash Functions. Hash Functions 4/17/18 HASHING

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Lecture 18. Collision Resolution

27/04/2012. Objectives. Collection. Collections Framework. "Collection" Interface. Collection algorithm. Legacy collection

Family Name:... Other Names:... ID Number:... Signature... Model Solutions. COMP 103: Test 1. 9th August, 2013

Linked lists (6.5, 16)

Topic 10: The Java Collections Framework (and Iterators)

The dictionary problem

CIT-590 Final Exam. Name: Penn Key (Not ID number): If you write a number above, you will lose 1 point

Priority Queue Sorting

CS 310: Maps and Sets and Trees

CS2110: Software Development Methods. Maps and Sets in Java

EXAMINATIONS 2012 Trimester 1, MID-TERM TEST. COMP103 Introduction to Data Structures and Algorithms SOLUTIONS

11-1. Collections. CSE 143 Java. Java 2 Collection Interfaces. Goals for Next Several Lectures

Depth-wise Hashing with Deep Hashing Structures. A two dimensional representation of a Deep Table

COMP 103 RECAP-TODAY. Hashing: collisions. Collisions: open hashing/buckets/chaining. Dealing with Collisions: Two approaches

COM1020/COM6101: Further Java Programming

Data Structures And Algorithms

Fall 2017 Mentoring 9: October 23, Min-Heapify This. Level order, bubbling up. Level order, bubbling down. Reverse level order, bubbling up

HASH TABLES. Hash Tables Page 1

Lecture 13: AVL Trees and Binary Heaps

CSE373 Fall 2013, Second Midterm Examination November 15, 2013

Linked Lists. References and objects

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

EXAMINATIONS 2010 END YEAR. COMP103 Introduction to Data Structures and Algorithms SOLUTIONS

CS 310: Maps and Sets

Hashing Techniques. Material based on slides by George Bebis

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Hash Tables. Hashing Probing Separate Chaining Hash Function

CSE 214 Computer Science II Searching

CSE 143 Lecture 14 AnagramSolver and Hashing

CS 3410 Ch 20 Hash Tables

Collections, Maps and Generics

More on Hashing: Collisions. See Chapter 20 of the text.

CMSC 132: Object-Oriented Programming II. Hash Tables

CSC263 Week 5. Larry Zhang.

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Announcements. Midterm exam 2, Thursday, May 18. Today s topic: Binary trees (Ch. 8) Next topic: Priority queues and heaps. Break around 11:45am

COMP 103 Introduction to Data Structures and Algorithms

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Abstract Data Types (ADTs) Queues & Priority Queues. Sets. Dictionaries. Stacks 6/15/2011

Hash[ string key ] ==> integer value

Lecture 7: Efficient Collections via Hashing

HASH TABLES. Goal is to store elements k,v at index i = h k

Model Solutions. COMP 103: Test April, 2013

CSC 321: Data Structures. Fall 2016

CSC 321: Data Structures. Fall 2017

WITH SOLUTIONS!!! WARNING: THIS COPY CONTAINS SOLUTIONS!!! COMP103 Introduction to Data Structures and Algorithms

U N I V E R S I T Y O F W E L L I N G T O N EXAMINATIONS 2018 TRIMESTER 2 COMP 103 PRACTICE EXAM

Hierarchical data structures. Announcements. Motivation for trees. Tree overview

Java Collections Framework reloaded

CSE 143. Lecture 7: Linked List Basics reading: 16.2

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Programming Languages and Techniques (CIS120)

CS61B Spring 2016 Guerrilla Section 6 Worksheet

(f) Given what we know about linked lists and arrays, when would we choose to use one data structure over the other?

HEAPS & PRIORITY QUEUES

Model Solutions. COMP 103: Mid-term Test. 19th of August, 2016

DATA STRUCTURES AND ALGORITHMS

mith College Computer Science Sets and Hashing CSC212 Fall 2014 Dominique Thiébaut

Transcription:

Announcements Today s topic: Hashing (Ch. 10) Next topic: Graphs Break around 11:45am Container structures so far Array lists O(1) access O(n) insertion/deletion (average case), better at end Linked lists O(n) access O(n) insertion/deletion (average case), better at front and back Binary search trees O(log n) access if balanced O(log n) insertion/deletion if balanced Heaps O(1) access of min/max O(log n) insertion O(log n) deletion (average case) Can we do even better? 1 2 Sets set: A collection of unique s (no duplicates allowed) that can perform the following operations efficiently: add, remove, search (contains) The client doesn't think of a set as having indices; we just add things to the set in general and don't worry about order IntSet ADT interface Let's think about how to write an implementation of a set. To simplify the problem, we only store ints in our set for now. As is (usually) done in the Java Collections Framework, we will define sets as an ADT by creating a Set interface. Core operations are: add, contains, remove. set.contains("to") set.contains("be") "the" "if" "of" "to" "down" "from" "by" "she" "in" "you" "why" "him" set true false 3 public interface IntSet { void add(int ); boolean contains(int ); void clear(); boolean isempty(); void remove(int ); int size(); 4 1

BST as a set We can implement a set as a binary search tree. O(log n) performance for: add contains Remove But there are other ways to implement a set perhaps with better performance. Is there a way to use an array s fast, O(1), access? -3 29 42 root 55 60 87 91 Unfilled array set? Consider storing a set in an unfilled array. It doesn't really matter what order the elements appear in a set, so long as they can be added and searched quickly. What would make a good ordering for the elements? If we store them in the next available index, as in a list, set.add(9); set.add(23); set.add(8); set.add(-3); 9 23 8-3 49 12 0 0 0 0 set.add(49); set.add(12); size 6 How efficient is add? contains? remove? O(1), O(n), O(n) 5 6 Sorted array set? Suppose we store the elements in an unfilled array, but in sorted order rather than order of insertion. set.add(9); set.add(23); set.add(8); -3 8 9 12 23 49 0 0 0 0 set.add(-3); set.add(49); set.add(12); size 6 How efficient is add? contains? remove? O(n), O(log n), O(n) A strange idea Silly idea: When client adds i, store it at index i in the array. Would this work? Problems/drawbacks of this approach? How to work around them? set.add(7); set.add(1); set.add(9); set.add(18); set.add(12); 0 1 0 0 0 0 0 7 0 9 size 3 0 1 2 3 4 5 6 7 8 9 0 1 0 0 0 0 0 7 0 9 0 0 12 0 0 0 0 0 18 0 7 8 2

Hashing Hash: To map a large domain of s to a smaller fixed domain. Typically, mapping a set of elements to integer indices in an array. Idea: Store any given element in a particular predictable index. That way, adding/removing/looking for it are constant-time (O(1)). hash table: An array that stores elements via hashing. Hash function: An algorithm that maps s to indices. hash code: The output of a hash function for a given. In previous slide, our "hash function" was: hash(i) i Potentially requires a large array. Doesn't work for negative numbers. Array could be very sparse, mostly empty (memory waste). Hashing overview element hash() The hash function, hash(), maps a range of elements of arbitrary type into an integer range [0, M-1]. hash code (integer index) hash table (array, M buckets) 0 1 2 3 M-1 9 10 Improved hash function Sketch of implementation To deal with negative numbers: hash(i) abs(i) To deal with large numbers: hash(i) abs(i) % length set.add(37); // abs(37) % 10 == 7 set.add(-2); // abs(-2) % 10 == 2 set.add(49); // abs(49) % 10 == 9 public class HashIntSet implements IntSet { private int[] elements; // the hash table public void add(int ) { elements[hash()] = ; 0 0-2 0 0 0 0 37 0 49 size 3 // inside HashIntSet class private int hash(int i) { return Math.abs(i) % elements.length; 11 public boolean contains(int ) { return elements[hash()] == ; public void remove(int ) { elements[hash()] = 0; Runtime of add, contains, and remove: O(1) Are there any problems with this approach? 12 3

Hash function In general, any function that maps from the space of elements to the space of array indices is a valid hash function but a good hash function spreads the indices out over the entire hash table (array). A good hash function also tries to avoid collisions - multiple elements having the same index in the hash table. Collisions Collision: When hash function maps 2 s to same index. set.add(11); set.add(49); set.add(24); set.add(37); set.add(54); // collides with 24! 0 11 0 0 54 0 0 37 0 49 Uniform hashing assumption: Hashing is most efficient when index s spread throughout the table 13 Collision resolution: An algorithm for fixing collisions Probing Separate chaining etc. Probing Probing: Resolving a collision by moving to another index. Linear probing: Moves to the next available index (wraps if needed). set.add(11); set.add(49); set.add(24); set.add(37); set.add(54); // collides with 24; must probe 0 11 0 0 24 54 0 37 0 49 variation: quadratic probing moves increasingly far away: +1, +4, +9, 15 Implementing HashIntSet Let s implement an int set using a hash table with linear probing. For simplicity, assume that the set cannot store 0s for now. public class HashIntSet implements IntSet { private int[] elements; private int size; // constructs new empty set public HashIntSet() { elements = new int[10]; size = 0; // hash function maps s to indices private int hash(int ) { return Math.abs() % elements.length; 16 4

The add operation How do we add an element to the hash table? Use the hash function to find the proper bucket index. If we see a 0, put it there, i.e., 0 means an available slot. If not, move forward until we find an empty (0) index to store it. If we see that the is already in the table, don't re-add it. set.add(54); set.add(); // client code 0 11 0 0 24 54 37 0 49 size 6 Implementing add How do we add an element to the hash table? public void add(int ) { int h = hash(); while (elements[h]!= 0 && elements[h]!= ) { // Linear probing h = (h + 1) % elements.length; // for empty slot. if (elements[h]!= ) { // Avoid duplicates. elements[h] = ; // Add it here. size++; 0 11 0 0 24 54 0 37 0 49 17 18 The contains operation How do we search for an element in the hash table? Use the hash function to find the proper bucket index. Loop forward until we either find the, or an empty index (0). If find the, it is contained (true). If we find 0, it is not (false). We assume that the table is never full. set.contains(24) set.contains() set.contains(35) // true // true // false 0 11 0 0 24 54 37 0 49 size 6 Implementing contains public boolean contains(int ) { int h = hash(); while (elements[h]!= 0) { if (elements[h] == ) { return true; h = (h + 1) % elements.length; return false; 0 11 0 0 24 54 0 37 0 49 // Linear probing // to search // not found 19 20 5

The remove operation Implementing remove We cannot remove by simply zeroing out an element: set.remove(54); // set index 5 to 0 set.contains() // false??? oops 0 11 0 0 24 0 34 0 49 Instead, we replace it by a special "removed" placeholder (can be re-used on add, but keep searching on contains) 0 11 0 0 24 XX 34 0 49 public void remove(int ) { int h = hash(); while (elements[h]!= 0 && elements[h]!= ) { h = (h + 1) % elements.length; if (elements[h] == ) { elements[h] = -999; // "removed" flag size--; set.remove(54); set.remove(11); set.remove(34); 0 11 0 0 24-999 34 0 49 // client code 21 22 Patching add, contains private static final int REMOVED = -999; // add needs patching. public void add(int ) { int h = hash(); while (elements[h]!= 0 && elements[h]!= && elements[h]!= REMOVED) { h = (h + 1) % elements.length; if (elements[h]!= ) { elements[h] = ; size++; // contains does not need patching; // it should keep going on a -999, which it already does public boolean contains(int ) { int h = hash(); while (elements[h]!= 0 && elements[h]!= ) { h = (h + 1) % elements.length; return elements[h] == ; 23 Problem: full array Clustering: Clumps of elements at neighboring indexes. Slows down the hash table lookup; you must loop through them. set.add(11); set.add(49); set.add(24); set.add(37); set.add(54); // collides with 24 set.add(); // collides with 24, then 54 set.add(86); // collides with, then 37 0 0 0 0 0 0 0 0 0 0 size 0 Where does each go in the array? How many indices must be examined to answer contains(94)? What will happen if the array completely fills up? 24 6

Rehashing Rehash: Using a larger array when the table is too full. Cannot simply copy the old array to a new one. (Why not?) Load factor: ratio of (# of elements ) / (hash table length) Many collections rehash when load factor.75 95 11 0 0 24 54 37 66 48 size 8 0 1 2 3 4 5 6 7 8 9 0 0 0 0 24 0 66 0 48 0 0 11 0 0 54 95 37 0 0 size 8 Hash table sizes Can use prime numbers as hash table sizes to reduce collisions. Also improves spread / reduces clustering on rehash. set.add(11); // 11 % 13 == 11 set.add(39); // 39 % 13 == 0 set.add(21); // 21 % 13 == 8 set.add(29); // 29 % 13 == 3 set.add(71); // 81 % 13 == 6 set.add(41); // 41 % 13 == 2 set.add(101); // 101 % 13 == 10 10 11 12 39 0 41 29 0 0 71 0 21 0 101 11 0 size 7 Google: Why setting Hash Table length to a Prime Number is a good practice? 25 26 Iterator for a hash table How would you implement an iterator for a hash table using linear probing, e.g., HashIntSet? And also for one with separate chaining (next page)? How would we implement tostring on our HashIntSet? Separate chaining Separate chaining: Solving collisions by storing a list at each index. add/contains/remove must traverse lists, but the lists are short impossible to "run out" of indices, unlike with probing. 0 11 0 0 24 54 0 37 0 49 System.out.println(set); // [11, 24, 54, 37, 49] 27 private class Node { public int data; 54 public Node next; Will see an alternative approach to implement chains later in MyHashMap.java Iterator for one with separate chaining? 11 24 7 49 28 7

Implementing HashIntSet Let s implement a hash set of ints using separate chaining. public class HashIntSet implements IntSet { // array of linked lists; // elements[i] = front of list #i (null if empty) private Node[] elements; private int size; // constructs new empty set public HashIntSet() { elements = new Node[10]; size = 0; The add operation How do we add an element to the hash table? When you want to modify a linked list, you must either change the list s front reference, or the next field of a node in the list. Where in the list should we add the new element? Must make sure to avoid duplicates. set.add(24); 11 54 7 49 // hash function maps s to indexes private int hash(int ) { return Math.abs() % elements.length; new node 29 30 24 Implementing add public void add(int ) { if (!contains()) { int h = hash(); // add to front Node newnode = new Node(); // of list #h newnode.next = elements[h]; elements[h] = newnode; size++; The contains operation How do we search for an element in the hash table? Must loop through the linked list for the appropriate hash index, looking for the desired. 11 24 7 49 set.contains() // true set.contains(84) // false set.contains(53) // false 54 31 32 8

Implementing contains public boolean contains(int ) { Node current = elements[hash()]; while (current!= null) { if (current.data == ) { return true; current = current.next; return false; The remove operation How do we remove an element from the hash table? Cases to consider: front (24), non-front (), not found (94), null (32) To remove a node from a linked list, you must either change the list's front reference, or the next field of the previous node in the list. set.remove(54); 11 24 7 49 current 54 33 34 Implementing remove public void remove(int ) { int h = hash(); if (elements[h]!= null && elements[h].data == ) { elements[h] = elements[h].next; // front case size--; else { Node current = elements[h]; // non-front case while (current!= null && current.next!= null) { if (current.next.data == ) { current.next = current.next.next; size--; return; current = current.next; 35 Rehashing with chaining Separate chaining handles rehashing similarly to linear probing. Loop over the list in each hash bucket; re-add each element. 11 24 54 7 49 10 11 12 13 15 16 17 18 19 24 7 49 11 54 36 9

Hash set of objects public class HashSet<E> implements Set<E> { private class Node { public E data; public Node next; It is easy to hash an integer i (use index abs(i) % length ). How can we hash other types of s (such as objects)? The hashcode method in Java All Java objects contain the following method (in Object): public int hashcode(); Returns an integer hash code for this object. We can call hashcode on any object to find its preferred index. HashSet, HashMap, and the other built-in "hash" collections call hashcode internally on their elements to store the data. We can modify our set s hash function to be the following: private int hash(e e) { return Math.abs(e.hashCode()) % elements.length; 37 38 Hash tables in Java HashTable class stores key/ pairs does not allow null for either key or older, slower class (thread-safe, synchronized) HashSet class implements Set interface, internal storage container that is a hash table fast (unsynchronized) cf. TreeSet class, internal storage container is a Red Black Tree HashMap class Implements Map interface, internal storage container for keys is a hash table allows null for key or fast (unsynchronized) Maps Also known as: table, search table, dictionary, associative array, or associative container A data structure optimized for a very specific kind of search / access with a bag we access by asking "is X present" with a list we access by asking "give me item number X" with a queue we access by asking "give me the item that has been in the collection the longest." In a map we access by asking "give me the associated with this key." cf. TreeMap class 39 40 10

Keys and s Dictionary analogy: The key in a dictionary is a word: foo The in a dictionary is the definition: First on the standard list of metasyntactic variables used in syntax examples A key and its associated form a pair that is stored in a map To retrieve a the key for that must be supplied A List can be viewed as a Map with integer keys (indices) Keys must be unique, meaning a given key can only represent one but one may be represented by multiple keys Implementing a HashMap A hash map is like a set where the nodes store key/ pairs: public class HashMap<K, V> implements Map<K, V> { // key map.put("marty", ); map.put("jeff", 21); map.put("kasey", 20); map.put("stef", 35); "Stef" 35 "Marty" Must modify your Node class to store a key and a "Jeff" 21 "Kasey" 20 41 42 Map ADT interface Let s think about how to write our own implementation of a map. As is (usually) done in the Java Collections Framework, we will define map as an ADT by creating a Map interface. Core operations: put (add), get, containskey, remove public interface Map<K, V> { void clear(); boolean containskey(k key); V get(k key); boolean isempty(); void put(k key, V ); void remove(k key); int size(); 43 HashMap vs. HashSet The hashing is always done on the keys, not the s. The contains method is now containskey; and in remove, you search for a node whose key matches a given key. The add method is now put; if the given key is already there, you must replace its old with the new one. map.put("bill", 66); // replace 49 with 66 "Stef" 35 "Marty" "Abby" 57 "Bill" 49 66 "Jeff" 21 "Kasey" 20 44 11

Java s TreeMap Uses a Red - Black tree to implement a Map relies on the compareto method of the keys slower than HashMap keys stored in sorted order (cf. Are keys in HashMap in sorted order?) Sample map problem Determine the frequency of words in a file. File f = new File(fileName); Scanner s = new Scanner(f); Map<String,Integer> counts = new HashMap<String,Integer>(); while(s.hasnext()){ String word = s.next(); if (!counts.containskey(word)) counts.put(word, 1); else counts.put(word, counts.get(word) + 1); 45 46 Implementing hashcode You can write your own hashcode methods in classes you write. All classes come with a default version based on memory address. Your overridden version should somehow "add up" the object's state. Often you scale/multiply parts of the result to distribute the results. public class Point { private int x; private int y; public int hashcode() { // better than just returning (x + y); // spreads out numbers, fewer collisions return 137 * x + 23 * y; Good hashcode behavior A well-written hashcode method should behave: Consistently with itself (must produce same results on each call): o.hashcode() == o.hashcode(), if o's state doesn't change Consistently with equality: a.equals(b) must imply a.hashcode() == b.hashcode(),!a.equals(b) does NOT necessarily imply that a.hashcode()!= b.hashcode() (why not?) When a class has an equals or hashcode, it should have both. Good distribution of hash codes: For a large set of objects with distinct states, they will generally return unique hash codes rather than all colliding into the same hash bucket. 47 48 12

Example: String hashcode hashcode tricks The hashcode function inside a String class looks like this: public int hashcode() { int hash = 0; for (int i = 0; i < this.length(); i++) { hash = 31 * hash + this.charat(i); return hash; As with any general hashing function, collisions are possible. Example: "Ea" and "FB" have the same hash. Early versions of Java examined only the first 16 characters. For some common data this led to poor hash table performance. 49 If one of your object s fields is an object, call its hashcode: public int hashcode() { // Student return 531 * firstname.hashcode() + ; To incorporate a double or boolean, use the hashcode method from the Double or Boolean wrapper classes: public int hashcode() { // BankAccount return 37 * Double.Of(balance).hashCode() + Boolean.Of(isCheckingAccount).hashCode(); Guava includes an Objects.hashCode() method that takes any number of s and combines them into one hash code. public int hashcode() { // BankAccount return Objects.hashCode(name, id, balance); 50 Hash tables vs. BST vs. heaps on search Example: using hash tables BSTs: has complete ordering information See UseHashSet.java, Student.java, StudentReader.java Heaps: has incomplete ordering information See UseHashMap.java Hash tables: has no order information See Hash.java 51 52 13

Example: implementing hash tables Next topic Using java.util.linkedlist as a chain in each bucket See MyHashSet.java Graphs See MyHashMap.java 53 54