CSCD 326 Data Structures I Hashing

1 CSCD 326 Data Structures I Hashing

Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional searching time complexity available is O(log2n) for binary search Binary search requires that data be stored in sorted order. Hashing approach to data storage and retrieval: Contiguous memory is not used and memory is sacrificed for speed. Often used for symbol table management in compilers, assemblers, and linker/loaders. 2

Hashing - Basic Ideas Data storage - hashing relies primarily on arrays for data storage but not on contiguous storage within the array Data storage/retrieval method: use a math function which, when given the key or data value to be stored, returns an array index in which to store the value. This is referred to as a "hash function." The same function will be used to retrieve the value later on. 3

Simple Example of hashing Employee data is to be stored using employee number as a key. Employee numbers are unique and run from 10,000 to 19,999. Storage: use an array of size 10,000. Hash function: Emp. Number - 10000 provides a unique index into the array and that array location is used to store/retrieve information for this employee. Problem: key values (in other situations) are often not unique or do not fall into a range which allows a reasonable size array. 4

Goals for Hashing Functions The same key value (value used for insertion) should always return the same index. If it does not - data can't be retrieved later. As much as possible - different key values should not hash to the same index. This is done by mixing things up with the hash function so that common patterns in key values do not hash to the same locations. This can never be prevented however - so collision handling becomes an issue. 5

6 Hash Function Construction Methods Using numeric ASCII values of characters: Example key: JUNK Add ASCII values of characters (74 + 85 + 78 + 75) to produce a single integer (312). This may suffice but the integer produced is not unique to "JUNK".

7 Hash Function Construction Methods (2) Concatenation of ASCII values: Represent A - Z as integers 0-25 and concatenate these values. So JUNK becomes: 9 20 13 10 01001101000110101010 = 315818 2 15 2 10 2 5 32768=32 3 1024=32 2 32=32 1 and so the concatenation can be expressed as: 9 * 32 3 + 20 * 32 2 + 13 * 32 1 + 10 = 315818

8 Hash Function Construction Methods (3) Using the mod operator: Allows reduction of large values into the range of actual hash table indices. in the example above if the table is an array of size 10000 --315818 % 10000 = 5818. Note here that the mod operator simply removes the first two digits and this makes the hashed value less unique to the string used to generate it.

Hash Function Construction Methods (4) Using the mod operator: Problems with use of mod operator - choice of exact table size is very important - if there are a large number of common factors - many collisions can be generated. e.g. table size 15 Key values 10, 20, 30, 40, 50, 60, 70 - here 7 values hash to three indices - 30,60 to 0-20,50 to 5 and 10,40,70 to 10 Solution - use an array size which is prime - thus it can't have any common factors with key values. 9

10 Hash Function Construction Methods (5) Using pseudo-random number generators: Given the same starting seed pseudo-random number generators always produce the same sequence of values. Here use a number generated from the key string as a seed and use the first resulting pseudo-random sequence value to generate the hash table index.

Hash Function Construction Methods (6) Folding Scrambles numeric values to remove the effects of recurring patterns- e.g. add the numeric values. Boundary Folding Breaks numbers into segments and adds digits in the segments. e.g. social security numbers: 534-65-9234 - breaks at dashes - hash value is 534 + 65 + 9234 Fan Folding Like boundary but reverses the digits in every other value. 11

12 Hash Function Construction Methods (7) Digit or character extraction Another way to scramble similar patterns in multiple keys - can be used in two ways: 1) Simply remove characters likely to be similar in many keys (or use dissimilar characters). 2) Mid-Square technique Represent key as a number. Square the number. Extract from the middle of the squared value enough bits to form an array index.

13 Linked Collision Processing Linked method of collision overflow handling divides memory into two parts: One part for primary storage (the hash table itself) A separate secondary part for collision overflow (may be either dynamically allocated or a separate fixed allocation area).

Linked Collision Processing (2) Linked collision overflow handling: Assume the hash table is composed of an array of objects which contain an instance variable which is a reference to an object of the same type. On collision: dynamically allocate a new node and place data into it. link the new node through the reference. overflowed items are stored in a linked list off the original table item. 14

15 Linked Collision Processing (3) Primary Memory (Hash Table) Secondary Memory (Overflow)

16 Linked Collision Processing (4) Search time with linked overflow If there have been many collisions - the search is no longer constant time complexity since a sequential search must be done through the linked list. Thus the time complexity becomes O(n) where n is the number of collisions.

Linear Collision Processing Also called Linear Probing - no primary and secondary memory - original array holds both. When a collision occurs: Start at hashed location (site of first collision) Proceed sequentially through the array until available storage is found - store at this location The array must be treated circularly since a probe could reach the end and need to start again at beginning. 17

18 Linear Collision Processing Problem with linear probing: clustering If the hash function produces one value more than others - parts of the table will quickly fill up while others are empty. Clustering causes further collisions later.

19 Analysis of Linear Probing Depends on the loading density of the hash table D - Number of Records in Hash Table / Size of Hash Table Array --- D = 1 indicates maximum density Average number of probes is proportional to: For a successful search: (½ (1 + 1/(1-D)) Unsuccessful search: (½ (1 + 1/(1-D) 2 )) for D = 0.1 --- 1.06 and 1.18 for D = 0.5 --- 1.50 and 2.50 for D = 0.8 --- 3.00 and 13.00 for D = 0.9 --- 5.50 and 50.50 This is why Linear Probing is referred to as a Density Dependant Search Technique

20 Rehashing Alternative to linear probing to avoid clustering. After a collision occurs - apply a different hash function to get a new location altogether. If new location is taken either resort to linear probing from there or apply a 3rd or 4th hash function Eventually some probing method must be used.

Quadratic Probing Another alternative to linear probing: if a collision occurs at initial index k: try to store in index k +1 for all successive collisions (k + 1, etc) try to store in index k + r 2 where r is a count of how many collisions have occurred Variation on rehashing-double hashing Use the second hash function to determine a fixed increment to move through the array. 21