TABLES AND HASHING. Chapter 13

Data Structures Dr Ahmed Rafat Abas Computer Science Dept, Faculty of Computer and Information, Zagazig University arabas@zu.edu.eg http://www.arsaliem.faculty.zu.edu.eg/

TABLES AND HASHING Chapter 13

13.1 Alternative methods of storing data hashing is a technique of storing data so that the amount of work required to retrieve a particular item is independent of the length of the list.

Example of hashing is the way arrays are stored and used. The array data type stores data at a location given by the array index. The location at which element i of an array is stored is calculated by: starting at the base address of an array and adding the size of each element of the array multiplied by i to this base address. This method of base plus offset means that the time required to locate any array element is a constant, independent of: its location in the array, the size of the array.

13.2 The table data structure The idea behind storing array elements can be generalized to allow any data to be stored in a onedimensional form. Example: suppose we wish to count the number of times each word occurs in a file. Word Frequency Define an array where each element in the array stores the count for a particular word. In doing so, we are faced with two problems: The character string forming a word is not an integer and, in C++, cannot be used as an array index. There are more than 400,000 words in the English language, only a small fraction of which will be used in any file of moderate length. To cope with these problems: a new data structure called a table is defined.

Tables are similar to arrays, but the word array refers to the actual data structure found in the C++ language. A table consists of a function or formula which maps members of one data type D (for example, the words in the word-counting problem) onto another type, called the index I (usually non-negative integers), which is used to store and access the original data.

Properties of a table data type A function which calculates the value of the index I given the data D. (Such a function in the word-counting problem would calculate the array index at which a particular word is stored.) Table insertion: a new data item (for example, a word) may be inserted into the table. Table retrieval: a table may be searched for a data item and, if present, it, and associated values, may be retrieved. (Given a word, the table is searched to see if that word is present and, if so, the count of the number of times it has been used may be retrieved.) Table deletion: a data item may be deleted from the table.

Provided the function which converts the original data into the table index I is efficient, a table can represent a considerable increase in efficiency over the searching routines studied earlier. Given the data for which you are searching: its location (if present) is calculated directly from the data, so you need only look in one place in the table to see if the word is there.

Example: if the word thing was calculated to have index 39, you need only look at location 39 in the table. If this location is empty, you know immediately that the word thing is not in the table, while if location 39 is occupied, it will be by the word thing and its associated count will be found at that location.

In practice, such clean access to the table is rarely found. This is because: You are usually trying to map data from a very large set (such as the 400,000 words in English) into a much smaller space (say, 1000 array elements). Since you don t know in advance what data items will occur, it is very difficult to find a transformation function which will map all the different data items that actually occur into different locations in the array.

Example: if the word thing maps into location 39, another word, such as computer, might also map into location 39. When this happens, a collision is said to occur. A method is needed to handle collisions in such a way that both these words can be stored in the array and, of course, can both be retrieved.

13.3 Hashing 13.3.1 Principles The process of mapping large amount of data into a smaller table is called hashing. The function which provides the map between the original data and the smaller table in which it is finally stored is a hash function. The table itself is called a hash table.

Fig. 13.1. Mapping items into a hash table

Operations of the table data type are implemented in hashing as follows: 1. The hash function provides the map which translates the data D into the index I. 2. A new data item D is inserted into the table by using the hash function to calculate its index I. If this location is free, the item is inserted into the table. If not, a procedure for resolving the collision must be given.

3. An item of data D may be retrieved by using the hash function to calculate its index I. Position I in the table is checked. If it is empty, item D is not present. If it is occupied, its contents must be tested to see: if they match item D, then D has been found. If not, there are two possibilities: (i) item D is not present in the table; (ii) item D is present in the table, but when it was inserted, the other item at index I was already there, causing a collision. In either case, the same procedure used for resolving a collision during insertion of an item must be followed to see if D is located somewhere else in the table.

4. Item deletion may proceed in a similar manner to insertion. the hash function is called to determine the location of the item if it is present, it may then be deleted. if a collision occurred when the item was originally inserted, care must be used in deleting it.

13.3.2 Choosing a hash function A good hash function should satisfy two criteria: 1. It should be quick to compute. 2. It should minimize the number of collisions.

1. Speed of computation The hash function should be simple, and minimize timeconsuming operations such as multiplication, division, or more complex functions such as square roots. Speed is important, because the hash function is used every time the table is accessed.

2. Minimization of collisions A hash function should spread the incoming data as evenly as possible over the hash table. Example: a bad hash function in the case of counting words: suppose we have a hash table of 1000 elements, and we choose a hash function that takes the ASCII code of the first character in the word and uses that as an array index. This method would provide only 26 different indexes, so that 974 sites in the table are not directly accessible by the hash function. Any two words beginning with the same letter would result in a collision.

Examples of commonly used hash functions 1. Truncation. 2. Folding. 3. Modular arithmetic.

1. Truncation Part of the key is ignored, with the remainder truncated or concatenated to form the index. Example: if we are storing 7-digit phone numbers in a hash table with 1000 elements, we may ignore all but the 2nd, 4th, and 7th digits in the phone number, so that a number such as 731-3018 would be indexed at location 338. This method is quick, as it involves accessing a few digits in the input data. the number of collisions it produces depends on how uniform the input data are.

If the table contains phone numbers for people living within a small area then the first three digits may be the same for all the numbers. In this case, all phone numbers would be hashed into indexes beginning with 3 in the table, so that 900 locations would remain unused. This problem could be solved by choosing the last three digits in the phone number instead. In general you should consider what regularities may be present in the data before deciding on a hash function.

2. Folding The data can be split up into smaller chunks which are then folded together in some form. Example: a 7-digit phone number could be split into three groups of 2, 2, and 3 digits, which are then added together and truncated to produce an index in the range 000 999. For the number 731-3018, we produce the three numbers 73, 13, and 018, which add up to 104, which may be used as the index. Another number such as 899-6989 would split into 89, 96, and 989, which add up to 1174. Since this number is larger than the highest allowed index in the hash table, we truncate it by saving only the last three digits, giving an index of 174.

3. Modular arithmetic Convert the data into an integer (using truncation, folding, or some other method), divide by the size of the hash table, and take the remainder as the index (for example, by using the % operator in C++). Example: modular arithmetic is used in the second example under folding : the phone number 899-6989 produced the index 1174 under the folding procedure. so this number is taken modulo the hash table size (1000) to produce the final index of 174.

13.3.3 Collision resolution with open addressing There are two main ways which collisions may be resolved: Open addressing: the amount of space available for storing data is fixed at compile time by declaring a fixed array for the hash table. Chaining: an array is also declared for the hash table, but each element in the array is a pointer to a linked list which holds all data with the same index.

In open addressing, when a collision occurs, another unoccupied location in the array should be found such that: a method of choosing an alternative location should be fast, the number of additional collisions that will occur as more data are added to the table should be minimized. Collision resolution methods in open addressing 1. Linear probing 2. Quadratic probing 3. Item-dependent probe distance 4. Pseudorandom number generator

Fig. 13.2 shows an item being inserted into a hash table using linear probing to resolve the collision. The item is mapped to location 4 by the hash function, but locations 4, 5, and 6 are already full, so the collision resolution method eventually places the item in location 7.

1. Linear probing If a collision occurs when inserting a new item into the table we probe forward in the array, one step at a time, until an empty slot is found to store the new data item. When retrieving this data: Calculate the hash function, Test the location given by the index to see if the required data item is there If not, examine each array element from the index location until the item is found, or until we encounter an empty site or examine all locations in the table, at which point we know the item is not in the table. When using linear probing, the array is circular, so that if the search past the end of the array, it starts at element 0.

The disadvantage of linear probing is that data tend to cluster around certain points in the table, leaving other parts of the table not used. This results in lengthy sequential searches through the table when retrieving data and therefore, the search efficiency is reduced.

How does clustering appears? suppose a hash function distributes data uniformly over a hash table of size n. When inserting the first element at location i the next element that is hashed to location i is placed in location i + 1. Site i + 1 has twice the chance of being filled in by the second element as any other site in the hash table. If sites i and i + 1 are filled by the first two elements, then site i + 2 will have three times the chance of any other element of being filled in by the third element, and so on. Therefore, any empty site at the end of a sequence of filled sites will receive any item that is hashed to any of the filled sites or that site directly. This results in long chains or clusters which require long sequences of comparisons in the retrieval process and thus reducing the search efficiency.

2. Quadratic probing One way of resolving the clustering problem is to use a collision resolution function that depends on: the index value, the number of previous attempts made to resolve the collision. If a collision occurs at position i, locations i + 1 2, i + 2 2, i + 3 2, and so on, are tested until an empty site is found.

Although this method reduces clustering, it does not probe every site in the table. if the table size is a prime number the maximum number of probed sites in a hash table of size n is (n+1)/2, so that approximately half the table is probed. Example: if the table size is n = 11, then for an element mapped to location 0, the six sites 0, 1, 4, 9, 5 (16 mod 11), and 3 (25 mod 11) will be probed. The next location to be probed by the quadratic probing algorithm would be site 3 again (36 mod 11), and all further sites produced by this algorithm will be one of the six already visited.

For table sizes that are not prime numbers the number of different sites probed by the quadratic probing algorithm can be less or more than (n+1)/2. Example: if the table size is n = 10, six sites are probed (sites 0, 1, 4, 9, 6, 5). For a table size that is a perfect square few sites will be probed. Example: if the table size is 16, only the four sites 0, 1, 4, and 9 are probed.

To maximize the number of probed sites with the same hash function value: avoid choosing table sizes that are perfect squares (or are divisible by perfect squares) choose your table size as either a prime number or a product of two different prime numbers.

3. item-dependent probe distance is used to truncate the data and the truncated form is used to calculate the increment. Example: the last digit of a phone number is used as an increment. 4. pseudorandom number generator Is used to generate a random increment. A pseudorandom number generator uses a seed value to generate a sequence of integers that appear random, but are actually calculated using a deterministic rule. provided the same seed is used for successive runs, the same sequence of numbers will be generated. As long as we keep track of the seed and where we are in the sequence of numbers we will always know where to probe next.

13.3.4 Deleting elements from hash tables Deletion is difficult to do efficiently in a hash table where open addressing is used. The reason is that : In any table where collisions have occurred during the insertion of data, there is a chain of items with the same index. If we want to delete any item that is not at the end of the chain, we will remove a link in the chain, thus disconnecting the elements beyond that link.

Example: suppose we have stored four items with the same index at sites i, j, k, l, and we wish to delete item j. First, locate the item by using the hash function to calculate its index. This will direct us to site i, where the first item with that index is stored. This is not the correct item, so we apply whatever collision resolution system we are using to locate the next site, which contains item j, the one we are looking for. If we delete j from that site, then the site will be empty. A subsequent search for items k or l will start by using the hash function to find their index, which, will start in site i. Applying the collision resolution system will lead us to the site formerly occupied by j. However, since j has been deleted, we will be confronted with an empty site, which is the signal that no more items of that index are present, so the search will terminate with the conclusion that k and l are not present in the table.

There are several solutions to this problem: 1. shifting the remaining items forward in the list when an item is deleted, 2. using a special flag which marks an empty cell as deleted rather than just empty so that searches will continue through this cell to see if any more items with that index are present. However, all these methods are rather slow and cumbersome.

13.3.5 Collision resolution with chaining The second method of resolving collisions involves using dynamic data allocation and linked lists. The hash table and associated hash function are defined in the usual manner, except that now the array is an array of pointers to linked lists, one list for each index.

Fig. 13.3, the array of pointers is shown as the vertical column of boxes on the left, with each box labelled with its hash function value. When a data item maps to a particular location, an extra node is allocated and added to the corresponding list. Note that a chained hash table can store more data items than the number of cells in the table. In this case, seven items are stored in a table with six cells.

When an item is inserted: If no data are stored at an index site, the corresponding pointer is set to 0. If an item is to be inserted, the hash function is used to find the list to which the item is to be added, and the standard insertion procedures for a linked list are used to insert the item. If a collision occurs, we simply add another node to the end of this list at the corresponding index.

When an item is to be retrieved: we use the hash function to calculate its index, and look at the corresponding pointer. If the pointer is 0, the item is not present. If the pointer points to a list, that list is traversed sequentially to see if the desired item is present. With a properly designed hash function, none of these lists should contain more than a few items, so sequential search is an efficient way to search them.

Deletion of an item from a table: The hash function is called to determine the index of the item to be deleted. The linked list at that index is searched and, if the item is present, its node is spliced out of the list in the usual way. We need not worry about isolating other parts of the table.

The disadvantage of using chaining is that a linked list requires extra storage space for the pointers connecting the list elements.