CS261 Data Structures. Maps (or Dic4onaries)

Similar documents
CS 261 Data Structures

Chapter 12: Dictionary (Hash Tables)

CSc 120. Introduc/on to Computer Programming II. 15: Hashing

CS 261 Data Structures

CS261 Data Structures. Ordered Bag Dynamic Array Implementa;on

Topic HashTable and Table ADT

Hash Tables. Hashing Probing Separate Chaining Hash Function

Hash[ string key ] ==> integer value

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Fast Lookup: Hash tables

Cpt S 223. School of EECS, WSU

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

DATA STRUCTURES AND ALGORITHMS

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Open Addressing: Linear Probing (cont.)

HASH TABLES. Hash Tables Page 1

Lecture 10 March 4. Goals: hashing. hash functions. closed hashing. application of hashing

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Hashing Techniques. Material based on slides by George Bebis

HASH TABLES. Goal is to store elements k,v at index i = h k

Hash tables. hashing -- idea collision resolution. hash function Java hashcode() for HashMap and HashSet big-o time bounds applications

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #5: Hashing and Sor5ng

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Part I Anton Gerdelan

Data and File Structures Chapter 11. Hashing

Topic 22 Hash Tables

CS 3410 Ch 20 Hash Tables

COMP171. Hashing.

CSE 100: UNION-FIND HASH

CS 310 Advanced Data Structures and Algorithms

1 CSE 100: HASH TABLES

Lecture 10: Introduction to Hash Tables

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

CSE 214 Computer Science II Searching

Hashing. =me to compute h independent of K ideally, then, lookup should take constant =me

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Hashing (Κατακερματισμός)

Hash table basics. ate à. à à mod à 83

CS 261: Data Structures. Dynamic Arrays. Introduction

CS 10: Problem solving via Object Oriented Programming Winter 2017

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

lecture23: Hash Tables

TABLES AND HASHING. Chapter 13

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hash Table and Hashing

Hash Tables. Gunnar Gotshalks. Maps 1

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

AAL 217: DATA STRUCTURES

UNIT III BALANCED SEARCH TREES AND INDEXING

CSE373: Data Structures & Algorithms Lecture 6: Hash Tables

HASH TABLES cs2420 Introduction to Algorithms and Data Structures Spring 2015

Understand how to deal with collisions

CSE 100: HASHING, BOGGLE

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

CS 10: Problem solving via Object Oriented Programming. Lists Part 1

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Hashing. 7- Hashing. Hashing. Transform Keys into Integers in [[0, M 1]] The steps in hashing:

You must print this PDF and write your answers neatly by hand. You should hand in the assignment before recitation begins.

CS 3114 Data Structures and Algorithms READ THIS NOW!

Hash table basics mod 83 ate. ate

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Hash Tables. Hash Tables

Hashing 1. Searching Lists

Hashing for searching

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

CMSC 341 Hashing. Based on slides from previous iterations of this course

Design Patterns for Data Structures. Chapter 9.5. Hash Tables

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

Abstract Data Types (ADTs) Queues & Priority Queues. Sets. Dictionaries. Stacks 6/15/2011

Successful vs. Unsuccessful

CSC 321: Data Structures. Fall 2017

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

COMP 103 RECAP-TODAY. Hashing: collisions. Collisions: open hashing/buckets/chaining. Dealing with Collisions: Two approaches

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Hash table basics mod 83 ate. ate. hashcode()

More on Hashing: Collisions. See Chapter 20 of the text.

Standard ADTs. Lecture 19 CS2110 Summer 2009

Local defini1ons. Func1on mul1ples- of

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS 350 : Data Structures Hash Tables

STANDARD ADTS Lecture 17 CS2110 Spring 2013

Transcription:

CS261 Data Structures Maps (or Dic4onaries)

Goals Introduce the Map(or Dic4onary) ADT Introduce an implementa4on of the map with a Dynamic Array

So Far. Emphasis on values themselves e.g. store names in an AVL tree to quickly lookup club members e.g. store numbers in an AVL tree for a tree sort ONen, however, we want to associate something else (ie. a value) with the lookup value (ie. a key) e.g. phonebook, dic4onary, student roster, etc.

Map ADT (or dic4onary or associa4ve array) A Map stores not just values, but Key- Value pairs void put (KT key, VT value) VT get (KT key) int containskey(kt key) void removekey (KT key) Struct Associa<on { KT key; VT value; }; All comparisons done on the key All returned values are VT Can implement with AVLTree, HashTable, DynArr, etc.

Map or Dic4onary Example Applica4on: Tag Cloud = Concordance + Frequencies con cord ance kənˈkôrdns/ noun noun: concordance; plural noun: concordances 1. an alphabe4cal list of the words (esp. the important ones) present in a text, usually with cita4ons of the passages concerned. Keys: unique words form the text Value: count of each word

Dynamic Array Map: put void putdynarraydic4onary (struct dyarray *data, KEYTYPE key, VALUETYPE val, comparator comparekey) { struct associa4on * ap; if (containskeydynarraydic4onary(vec, key, comparekey)) removekeydynarraydic4onary (vec, key, comparekey); ap = (struct associa4on *) malloc(sizeof(struct associa4on)); assert(ap!= 0); ap- >key = key; ap- >value = val; adddynarray(vec, ap); }

Dynamic Array Map: Contains int containsmap (DynArr *v, KT key, comparator compare){ int i = 0; } for (i = 0; i < v- >size; i++) { if ((*compare)(((struct associa4on *)(v- >data[i]))- >key, } return 0; key) == 0 ) /* found it */ return 1;

CS261 Data Structures Hash Tables Concepts

Goals Hashing Concepts

Searching...Beler than O(log n)? Skiplists and AVL trees reduce the 4me to perform opera4ons (add, contains, remove) from O(n) to O(log n) Can we do beler? Can we find a structure that will provide O(1) opera4ons? Yes. No. Well, maybe...

Hash Tables Hash tables are similar to arrays except Elements can be indexed by values other than integers Huh??? Mul4ple values may share an index What???

Hashing with a Hash Func4on Key ie. string, url,etc. integer index 0 Key y 1 Key w 2 Key z 3 Hash Table 4 Key x Hash func<on Hash to index for storage AND retrieval!

Applica4on Example Spell checker Know all your words before hand Need FAST lookups so you can highlight on the fly Compute an integer index from the string idx val Helo j 3 0 hello 1 pizza 2 dog 3 with 4 front Helo j 5 the 6 well

Hashing to a Table Index Compu4ng a hash table index is a two- step process: 1. Transform the value (or key) to an integer (using the hash func4on) 2. Map that integer to a valid hash table index (using the mod operator)

Hash Func4on Goals FAST (constant 4me) Produce UNIFORMLY distributed indices REPEATABLE (ie. same key always results in same index)

FIXME: Something Discussion of hash func4on magic here and how it s typically done!

Step 1: Transforming a key to an integer Mapping: Map (a part of) the key into an integer Example: a leler to its posi4on in the alphabet Folding: key par44oned into parts which are then combined using efficient opera4ons (such as add, mul4ply, shin, XOR, etc.) Example: summing the values of each character in a string Key Mapped chars (position in alphabet) Folded (+) eat 5 + 1 + 20 26 ate 1 + 20 + 5 26 tea 20 + 5 + 1 26

Step 1: Transforming a key to an integer ShiNing: can account for posi4on of characters ShiNed by posi4on in the word (right to len): 0th leler shined len 0, first leler shined len 1, etc. Key Mapped chars (pos in alpha) Folded (+) Shifted and Folded eat 5 + 1 + 20 26 20 + 2 + 20 = 42 ate 1 + 20 + 5 26 4 + 40 + 5 = 49 tea 20 + 5 + 1 26 80 + 10 + 1 = 91

Step 2: Mapping to a Valid Index Use modulus operator (%) with table size: Example: idx = hash(val) % size; Use only posi4ve arithme4c or take absolute value To get a good distribu4on of indices, prime numbers make the best table sizes:

Hash Tables: Collisions A collision occurs when two values hash to the same index How do we deal with this????

CS261 Data Structures Hash Tables Buckets/Chaining

Hash Tables: Resolving Collisions There are two general approaches to resolving collisions: 1. Open address hashing: if a spot is full, probe for next empty spot 2. Chaining (or buckets): keep a collec4on at each table entry

Resolving Collisions: Chaining Maintain a collec4on (typically a Map ADT) at each table entry: Each collec4on is called a chain (some4mes referred to as a bucket)

Hash Table Implementa4on: Ini4aliza4on struct HashTable { struct Linked List **table; /* Hash table à Array of Lists. */ int capacity; int count; } void inithashtable(struct HashTable *ht, int cap) { int i; } ht- >capacity = cap; ht- >count = 0; ht- >table = malloc(ht- >capacity * sizeof(struct LinkedList *)); assert(ht- >table!= 0); for(i = 0; i < ht- >capacity; i++) ht- >table[i] = createlinkedlist();

Hash Table Implementa4on: Add void addhashtable(struct HashTable *ht, TYPE val) { /* Compute hash table index. */ int idx = hash(val) % ht- >capacity; } /* Add to the chain. */ addlist(ht- >table[idx], val); ht- >count++;

Hash Table: Contains & Remove Contains: find correct bucket using the hash func4on, then check to see if element is in the linked list Remove: if element is in the table (e.g. contains() returns true, remove it from the linked list and decrement the count

Hash Table Size How large should your table be? How does it affect performance?

Hash Table Size Load factor: # of elements (count) Load factor λ = n / m Capacity of table Load factor represents average number of elements in each bucket For chaining, load factor can be greater than 1

Hash Table Load factor: Load factor λ = n / m # of elements (count) Capacity of table The average number of links traversed in successful searches, S, and unsuccessful searches, U, is S λ 2 U λ To maintain good performance: if load factor becomes larger than some fixed limit (say, 8) à double table size

Hash Tables: Algorithmic Complexity Assuming: Time to compute hash func4on is constant Chaining uses a linked list Worst case analysis à All values hash to same posi4on Best case analysis à Hash func4on uniformly distributes the values and no collisions Contains opera4on: Worst case for chaining à O( n ) Best case for chaining à O( 1 )

Hash Tables With Chaining: Average Case (Unsuccessful) Assume hash func4on distributes elements uniformly (a BIG if) And we have collisions Unsuccessful search: Any key, K, is equally likely to hash to any of the m slots Ave. 4me to search for a non- existent key is average 4me to search to end of a chain = λ Total Time: O(1 + λ) = O(λ) [Time to compute hash func4on + λ]

Hash Tables With Chaining: Average Case (Successful) Assume hash func4on distributes elements uniformly (a BIG if) And we have collisions Successful search: Any key, K, is equally likely to be any of the n keys stored in the table Assume values are added to end of a chain Expected number of elements examined upon success is 1 more than number elements examined when the sought for element was inserted Therefore, take average over the n items in the table, of 1 + expected length of the list to which the i th element is added: over n items in table ave 1 n n i=1 " 1+ i 1 % $ ' # m & one more expected length of the list to which the i th element was added

Hash Tables with Chaining: Average Case (successful) 1 n n " 1+ i 1 % " $ ' =1+ 1 % $ ' n (i 1) i=1# m & # nm & i=1 ( )! =1+ 1 $! # &# n n 1 " nm %" 2 (n 1) =1+ 2m =1+ λ 2 1 2m ( ) Time for successful search, including 4me for hashing: O(1+ λ/2-1/2m) = O(1+λ) So, if number of slots in table,m, is at least propor4onal to number of elements in the table,n, then n = O(m) and consequently, λ = n/m = O(m)/m = O(1) $ & %

Hash Tables With Chaining: Average Case So, we want to keep the load factor rela4vely small We monitor (e.g. check ) the load factor Resize table (doubling its size) if load factor is larger than some fixed limit (e.g., 8) Only improves things IF hash func4on distributes values uniformly How do we handle a resize?

Design Decisions Implement the Map interface to store values with keys (ie. implement a dic4onary) Rather than store linked lists, build the linked lists directly Link **hashtable;

Next Assignment: Hash Map Concordance + Frequencies Use a Hashtable to implement Map Lookup key [word], return value [frequency count] (or, return a ptr so you can update it!) help 3 idx 0 1 2 3 4 val run 7 help 2 ride 4 5 6 means that help has occurred 2 4mes in this selec4on of text!

Your Turn Worksheet 38: Hash Tables using Buckets

Perfect Hash Func4on A perfect hash func4on is one where there are no collisions Minimally Perfect: No collisions AND table size = # of elements Only possible when know all your keys in advance of building the table

Minimally Perfect Hash Func4on Posi4on of 3 rd leler (star4ng at len, index 0), mod 6 Alfred f = 5 % 6 = 5 Alessia e = 4 % 6 = 4 Amina i = 8 % 6 = 2 Amy y = 24 % 6 = 0 Andy d = 3 % 6 = 3 Anne n = 13 % 6 = 1 0 Amy 1 Anne 2 Amina 3 Andy 4 Alessia 5 Alfred

CS261 Data Structures Hash Tables Open Address Hashing

Goals Open Address Hashing

Hash Tables: Resolving Collisions There are two general approaches to resolving collisions: 1. Open address hashing: if a spot is full, probe for next empty spot 2. Chaining (or buckets): keep a collec4on at each table entry

Open Address Hashing All values are stored in an array Hash value is used to find inical index If that posi4on is filled, the next posi4on is examined, then the next, and so on un4l you find the element OR an empty posi4on is found The process of looking for an empty posi4on is termed probing, specifically linear probing when we look to the next element

Open Address Hashing: Example Eight element table using Amy s hash func4on(alphabet posi4on of the 3 rd leler of the name - 1): Already added: Amina, Andy, Alessia, Alfred, and Aspen Amina Andy Alessia Alfred Aspen 0- aiqy 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx Note: We ve shown where each leler of the alphabet maps to for simplicity here (given a table size of 8) so you don t have to calculate it! e.g. Y is the 25 th leler (we use 0 index, so the integer value is 24) and 24 mod 8 is 0

Open Address Hashing: Example Eight element table using Amy s hash func4on(alphabet posi4on of the 3 rd leler of the name - 1): Already added: Amina, Andy, Alessia, Alfred, and Aspen Amina Andy Alessia Alfred Aspen 0- aiqy 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx Note: We ve shown where each leler of the alphabet maps to for simplicity here (given a table size of 8) so you don t have to calculate it! e.g. Y is the 25 th leler (we use 0 index, so the integer value is 24) and 24 mod 8 is 0

Open Address Hashing: Adding Now we need to add: Aimee Add: Aimee Hashes to Placed here Amina Andy Alessia Alfred Aimee Aspen 0- aiqy 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx The hashed index posi4on (4) is filled by Alessia: so we probe to find next free loca4on

Open Address Hashing: Adding (cont.) Suppose Anne wants to join: Add: Anne Hashes to??? Amina Andy Alessia Alfred Aimee Aspen 0- aiqy 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx The hashed index posi4on (5) is filled by Alfred: Probe to find next free loca4on à what happens when we reach the end of the array

Open Address Hashing: Adding (cont.) Suppose Anne wants to join: Placed here Add: Anne Hashes to Amina Anne Andy Alessia Alfred Aimee Aspen 0- aiqy 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx The hashed index posi4on (5) is filled by Alfred: Probe to find next free loca4on When we get to end of array, wrap around to the beginning Eventually, find posi4on at index 1 open

Open Address Hashing: Contains Hash to find ini4al index probe forward un4l value is found, (return 1) OR empty loca3on is found (return 0) Amina Andy Alessia Alfred Aimee Aspen 0- aiqy 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx No4ce that search 4me is not uniform

Open Address Hashing: Remove Remove is tricky What happens if we delete Anne, then search for Alan? Remove: Anne Amina Anne Alan Alessia Alfred Aimee Aspen 0- aiqy Find: Alan 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx Hashes to Probing finds null entry à Alan not found Amina Alan Alessia Alfred Aimee Aspen 0- aiqy 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx

Open Address Hashing: Handling Remove Simple solu4on: Don t allow removal (e.g. words don t get removed from a spell checker!) Alterna4ve solu4on: replace removed item with tombstone Special value that marks deleted entry Can be replaced when adding new entry But doesn t halt search during contains or remove Find: Alan Hashes to Probing skips tombstone à Alan found Amina _TS_ Alan Alessia Alfred Aimee Aspen 0- aiqy 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx

Hash Table Size: Load Factor Load factor: # of elements (count) Load factor λ = n / m Capacity of table represents the por4on of the tables that is filled For open address hashing, load factor is between 0 and 1 (onen somewhere between 0.5 and 0.75) Want the load factor to remain small in order to avoid collisions - space/speed tradeoff again!

Hash Tables: Algorithmic Complexity Assump4ons: Time to compute hash func4on is constant Worst case analysis à All values hash to same posi4on Best case analysis à Hash func4on uniformly distributes the values and there are no collisions Find element opera4on: Worst case for open addressing à O( n ) Best case for open addressing à O( 1 )

Hash Tables: Average Case What about average case for successful, S, and unsuccessful searches, U? S = 1 λ ln 1 1 λ + 1 λ λ S U 0.5 ~3.387 2 U = 1 1 λ 0.9 ~3.7 10 If λ is constant, average case is O(1), but want to keep λ small

Clustering Assuming uniform distribu4on of hash values what s the probability that the next value will end up in index 6? in index 2? in index 1? Amina Andy Alessia Alfred Aspen 0- aiqy 1- bjrz 2- cks 3- dlt 4- emu 5- fnv 6- gow 7- hpx 3/8 1/8 4/8 As load factor gets larger, the tendency to cluster increases, resul4ng in longer search 4mes upon collision

Double Hashing Rather than use a linear probe (ie. looking at successive loca4ons) Use a second hash func4on to determine the probe step Helps to reduce clustering

Large Load Factor: What to do? Common solu4on: When load factor becomes too large (say, bigger than 0.75) à Reorganize Create new table with twice the number of posi4ons Copy each element, rehashing using the new table size, placing elements in new table Delete the old table

Hashing in Prac4ce Need to find good hash func4on à uniformly distributes keys to all indices Open address hashing: Need to tell if a posi4on is empty or not One solu4on à store only pointers & check for null (== 0)

Tradeoffs Chaining takes more memory (ptrs) Chaining onen preferred when have to deal with dele4ons

Your Turn Complete Worksheet #37: Open Address Hashing