UNIT III BALANCED SEARCH TREES AND INDEXING

Size: px

Start display at page:

Download "UNIT III BALANCED SEARCH TREES AND INDEXING"

Nancy Ellis
5 years ago
Views:

1 UNIT III BALANCED SEARCH TREES AND INDEXING OBJECTIVE The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions and finds in constant average time. The central data structure in this chapter is the hash table. We will o o o o See several methods of implementing the hash table. Compare these methods analytically. Show numerous applications of hashing. Compare hash tables with binary search trees. INTRODUCTION Tree operations that require any ordering information among the elements are not supported efficiently. Thus, operations such as find_min, find_max, and the printing of the entire table in sorted order in linear time are not supported. AVL Trees Is a binary search tree with a balance condition. Condition: Every node must have left and right subtrees of the same height. Page 1

2 Only perfectly balanced trees of 2 k -1 nodes would satisfy this condition. Although this guarantees trees of small depth, the balance condition is too rigid. AVL tree Is identical to a binary search tree, except that for every node in the tree, the height of the left and right subtrees can differ by at most 1. Fig (P.111) Height information is kept for each node. Example: An AVL tree of height 9 with the fewest nodes (143). Fig (P.112) The minimum number of nodes, S(h), in an AVL tree of height h is given by: S(h) = S(h-1) + S(h-2) + 1 S(0) = 1 S(1) = 2 Balance condition violation Let the node that must be rebalanced be Since any node has at most 2 children, and a height imbalance requires that 's 2 subtrees' height differ by 2, it is easy to see that a violation might occur in 4 cases: 1. An insertion into the left subtree of the left child of. (L-L) Page 2

3 2. An insertion into the right subtree of the left child of.(l-r) 3. An insertion into the left subtree of the right child of.(r-l) 4. An insertion into the right subtree of the right child of. (R-R) Cases 1 and 4 are fixed by a single rotation of the tree. Cases 2 and 3 are fixed by a double rotation of the tree. 4.1 Single Rotation Single rotation that fixes case 1 Fig (P.113) Node k 2 violates the AVL balance property because its left subtree is 2 levels deeper than its right subtree. In the original tree k2 > k1, so k 2 becomes the right child of k 1 in the new tree. Subtree Y, which hold items that are between k 1 and k 2 in the original tree, can be placed as k 2 's left child in the new tree. The new height of the entire subtree is exactly the same as the height of the original subtree prior to the insertion that caused X to grow. Example Fig (P.113) (insert 6 into the original AVL tree on the left) Single rotation that fixes case 4 Fig (P.114) Page 3

4 Example Insert the keys 3, 2, 1 and then 4 through 7 in sequential order into an initially empty AVL tree. 4.2 Double Rotation Diagrams (P ) Single rotation does not work for cases 2 or 3. Fig (P.116) Subtree Y in figure 4.34 has had an item inserted into it guarantees that it is nonempty. Thus, assume that it has a root and two subtrees. Consequently, the tree may be viewed as four subtrees connected by three nodes. (Left part of Fig. 4.35, P.116) Exactly one of tree B or C is two levels deeper than D (unless all are empty). Left-right double rotation to fix case 2 Fig (P.116) Rotate between 's child and grandchild, and then between and its new child. It restores the height to what it was before the insertion. Right-left double rotation to fix case 3. Fig (P.116) Example: To continue the example in 4.1 by inserting the keys 10 through 16 in reverse order, followed by 8 and then 9. Diagrams in P To insert a new node with key X into an AVL tree T: Recurisvely insert X into the appropriate subtree of T (say T LR ) Page 4

5 If the height of T LR does not change, then we are done. Otherwise, if a height imbalance appears in T, we do the appropriate single or double rotation depending on X and the keys in T and T LR, update the height. Storage of height information as: Difference in height (i.e. +1, 0, -1), only require 2 bits. Avoid repetitive calculation Loss of clarity Coding is more complicated than if the height were stored at each node. Absolute heights Fig (P ) 6: PRIORITY QUEUES (HEAPS) 1. Model At least two operations: Insert DeleteMin : find, return and remove the minimum element in the priority queue. Fig. 6.1 (P.178) 2. Simple Implementation Simple linked list Insertions are performed at the front in O(1). Traversing the list, which requires O(N) time, to delete the minimum. Page 5

6 If the list be kept sorted: Insertions take O(N) time. DeleteMins take O(1) time. Binary search tree This gives an O(log N) average running time for both operations. 3. Binary Heap It is common for priority queue implementations. Two properties: A structure property A heap order property 3.1 Structure Property A heap is a binary tree that is completely filled, with the possible exception of the bottom level, which is filled from left to right. (complete binary tree) A complete binary tree of height h has between 2 h and 2 h+1-1 nodes. The height of a complete binary tree is log N. Because a complete binary tree is so regular, it can be represented in an array and no pointers are necessary. Fig (P.179) For any element in an array position i, The left child: 2i The right child: 2i + 1 The parent: i / 2 Page 6

7 Problem: An estimate of the maximum heap size is required in advance. A heap data structure consists of: An array. An integer representing the maximum heap size. An integer representing the current heap size. Fig. 6.4 (P ) 3.2 Heap Order Property For every node X, the key in the parent of X is smaller than (or equal to) the key in X, with the exception of the root (which has no parent). Fig. 6.5 (P.181) The minimum element can always be found at the root. (FindMin in constant time) 3.3 Basic Heap Operations To insert an element X into the heap: (1) Create a hole in the next available location. (2) If X can be placed in the hole without violating heap order, then finished. (3) Otherwise, slide the element that is in the hole's parent node into the hole, thus bubbling the hole up toward the root. (4) Continue this process until X can be placed in the hole. Percolate up : The new element is percolated up the heap until the correct position is found. Fig (P.183) Page 7

8 If the element to be inserted is the new minimum, it will be pushed all the way to the top. In figure 6.8, a very small value in position 0 in order to make the loop terminate. The time to do the insertion could be as much as O(log N). DeleteMin When the minimum is removed, a hole is created at the root. The last element X in the heap must move somewhere in the heap. Algorithm: (1) If X can be placed in the hole, then we are done (this is unlikely). (2) Otherwise, slide the smaller of the hole's children into the hole, thus pushing the hole down one level. (3) Repeat this step until X can be placed in the hole. Percolate down Fig (P ) In figure 6.12, line 8 tests if there are two children. The worst-case running time is (log N). 3.4 Other Heap Operations A heap has very little ordering information, so there is no way to find any particular key without a linear scan through the entire heap. The only information known about the maximum element is that it is at one of the leaves. Fig (P.183) Page 8

9 DecreaseKey DecreaseKey(P,,H) Lower the value of the key at position P by a positive amount. Since this might violate the heap order, it must be fixed by a percolate up. IncreaseKey IncreaseKey (P,, H) Increase the value of the key at position P by a positive amount. This is done with a percolate down. Delete Delete(P, H) Remove the node at position P from the heap. This is done by first performing DecreaseKey(P,, H) and then performing DeleteMin(H). BuildHeap BuildHeap (H) Take as input N keys and place them into an empty heap. Algorithm (1) Place the N keys into the tree in any order, maintaining the structure property. (2) Perform the algorithm in figure 6.14 to create a heapordered tree. Page 9

10 4. Applications of Priority Queues 4.1 The Selection Problem Fig (P ) To find the kth largest element from the input list of N elements Algorithm 6A To find the kth smallest element Algorithm (1) Read the N elements into an array. (2) Apply the BuildHeap algorithm to the array. (3) Perform k DeleteMin operations. (4) The last element extracted from the heap is the answer. Total running time: O(N + k log N) where O(N) to construct the heap O(log N) for each DeleteMin (there are k DeleteMins) If this algorithm is executed for k = N and record the values as they leave the heap, the input list is sorted in O(N log N). Algorithm 6B To find the kth largest element. Algorithm (1) At any point in time, maintain a set S of the k largest elements. (2) After the first k elements are read, when a new element is read it is compared with the kth largest element, S k. (S k is the smallest element in S). (3) If the new element is larger, then it replaces S k in S. (4) At the end of the input, return the smallest element in S as the answer. Page 10

11 Use a heap to implement S. Total time is O(k + (N - k) log k) = O(N log k). 4.2 Event Simulation A bank waiting queue simulation consists of processing two events. A customer arriving A customer departing, thus freeing up a teller. Probability functions are used to generate an input stream consisting of ordered pairs of arrival time and service time for each customer, sorted by arrival time. To process the event that happens nearest in the future and process that event. If event is a departure: Gather statistics for the departing customer and check the queue to see whether there is another customer waiting. If so, add that customer and compute the time when that customer will leave, and add that departure to the set of events waiting to happen. If the event is an arrival, Check for an available teller. If there is none, place the arrival on the queue. Otherwise, give the customer a teller, compute the customer's departure time, and add the departure to the set of events waiting to happen. The waiting line for customers can be implemented as a queue. Since we need to find the event nearest in the future, it is appropriate that the set of departures waiting to happen be organised in a priority queue. Page 11

12 5. d-heaps Exactly like a binary heap except that all nodes have d children. Fig (P.192) A d-heap is much shallower than a binary heap, improving the running time of Inserts to O(log d N). For large d, the DeleteMin operation is more expensive, because even though the tree is shallower, the minimum of d children must be found, which takes d - 1 comparisions. The time for this operation raised to O(d log d N). Multiplications and divisions to find children and parents are now by d, which, unless d is a power of 2, seriously increases the running time, because no longer implement multiplication and division by a bit shift. Two most weaknesses of heap implementation: Inability to Finds Merging is a hard operation. OBJECTIVE TYPE QUESTIONS 1. In balanced search tree every node must have left and right sub trees are same a) Height b) Sibling c) Child d) Path 2. Balanced tree have a type of rotation a) One b) Two c) Three d) Four 3. Priority queues have at least two operations like deletemin and a) Insert b) update c) Create d) Deletemax 4. In heap, when the minimum is removed, a hole is created at a) Root b) Child c) Leaf d) Sibling 5. Priority queues are also known as a) Heap b) Balanced tree c) search tree d) Stack 6. Binary heap implementations have properties. a) One b) Two c) Three d) Four 7. A heap data structure consists of a) One b) Two c) Three d) Four Page 12

13 7. B-TREES A B-tree of order M is a tree with the following properties: The root is either a leaf or has between 2 and M children. All nonleaf nodes (except the root) have between [M/2] and M children. All leaves are at the same depth. The actual data are stored at the leaves. Pointers P 1, P 2,..., P M to the children and values K 1, K 2,..., K M-1, representing the smallest key found in the subtrees P 2, P 3,...P M, respectively are stored in each interior node. For every node, all the keys in subtree P 1 are smaller than the keys in subtree P 2, and so on. Fig (P.134) (an example of a B-tree of order 4) Insertion operation of B-trees (order 3) Diagrams (P ) The keys in the leaves are ordered. When attempting to add a fourth key to a leaf, instead of splitting the node into two, we can first attempt to find a sibling with only two keys. This strategy can also applied to internal nodes and tends to keep more nodes full. Cost: more complicated routines. Less space tends to be wasted. Page 13

14 Deletion operation Find the key to be deleted and remove it. If this key was one of only two keys in a node, then its removal leaves only one key. Combine this node with a sibling. If the sibling has 3 keys, steal one and have both nodes with 2 keys. If the sibling has 2 keys, combine the 2 nodes into a single node with 3 keys. Remember to update the information kept at the internal nodes. The depth of a B-tree is at most [log [M/2] N]. At each node on the path, perform O(log M) work to determine which branch to take. The worst running time for each of the Insert and Delete operations is O(M log M N). Find takes O(log N). Real use of B-trees In database systems, where the tree is kept on a physical disk instead of main memory. Accessing a disk is typically slower than any main memory operation. The number of disk accesses is O(log M N). Page 14

15 CHAPTER 5 HASHING 1. GENERAL IDEA The ideal hash table data structure is merely an array of some fixed size, containing the keys. Each key is mapped into some number in the range 0 to TableSize - 1 and placed in the appropriate cell. (Hash function) A hash function should: Be simple to compute Ensure that any two distinct keys get different cells. Fig. 5.1 (P.150) 2. HASH FUNCTION 2.1 Hash function (keys are integers) Key mod TableSize The above hash function is reasonable unless Key happens to have some undesirable properties. If the table size is prime and the input keys are random integers, then this function is: Simple to compute Distribute the keys evenly. 2.2 Hash function (keys are strings) Function 1: Add up the ASCII values of the characters in the string. Fig (P ) Page 15

16 If the table size is large, the function does not distribute the keys well. Function 2: Assumption: the Key has at least 2 characters plus the NULL terminator. Fig. 5.4 (P.151) The value 27 represents the number of letters in English alphabet, plus the blank and 729 is This function examines only the first three characters. Drawback: English is not random. Only 28 percent of the table can actually be hashed to (assume the table size is 10,007.) Function 3: This function involves all characters in the key and be expected to distribute well. It computes KeySize 1 i 0 Fig. 5.5 (P.152) Key[ KeySize i 1]*32 The value of 32 is used instead of 27, because multiplication by 32 is not really a multiplication, but amounts to bitshifting by 5. The addition could be replaced with a bitwise Exclusive OR, for increased speed. If the keys are long, take too long to compute. (solution: not to use all the characters.) i Page 16

17 The length and properties of the keys would influence the choice. Collision: When an element is inserted, it hashes to the same value as an already inserted element. 3. SEPARATE CHAINING To keep a list of all elements that hash to the same value. Fig. 5.6 (P.153) To perform a Find: Use the hash function to determine which list to traverse. Then, traverse the list and return the position where the item is found. To perform an Insert: Traverse down the appropriate list to check whether the element is already in place. If no duplication, the new element is inserted at the front of the list or at the end of the list. If duplicates are expected, an extra field is usually kept to record the number of match. Fig (P ) In figure 5.8, line 12 is inefficiency because the malloc is performed H->TableSize times. This can be avoided by replacing it with one call to malloc before the loop occurs as below: H->TheLists = malloc (H->TableSize * sizeof(struc ListNode)); Fig (P.156) If the ElementType is a string, comparison and assignment must be done with strcmp and strcpy. Page 17

18 The insertion routine in figure 5.10 is poorly coded because it computes the hash function twice. If the table is large and the hash function is good, all the lists should be short. Load factor Is the ratio of the number of elements in the hash table to the table size. The effort required to perform a search is the constant time required to evaluate the hash function plus the time to traverse the list. The general rule is to make the table size about as large as the number of elements expected (i.e. let 1). To keep the table size prime to ensure a good distribution. Disadvantages: Require pointers (slow the algorithm down a bit) Require the implementation of a second data structure (list) 4. OPEN ADDRESSING If a collision occurs, alternative cells are tried until an empty cell is found. Cells h 0 (X), h 1 (X), h 2 (X),... are tried in succession, where h i (X) = (Hash(X) + F(i)) mod TableSize, with F(0) = 0. The function, F, is the collision resolution strategy. All the data go inside the table, a bigger table is needed for open addressing hashing than for separate chaining hashing. Generally, the load factor should be below 0.5. Page 18

19 4.1 LINEAR PROBING F is a linear function of i, typically F(i) = i. Trying cells sequentially (with wraparound) in search of an empty cell. Fig (P.158) (inserting keys 89, 18, 49, 58, 69) As long as the table is big enough, a free cell can always be found, but the time to do so can get quite large. Primary clustering Any key that hashes into the cluster will require several attempts to resolve the collision, and then it will add to the cluster. (even if the table is relatively empty) The expected number of probes is roughly 1/2 (1 + 1/(1 - ) 2 ) for insertions and unsuccessful searches, and 1/2 (1 + 1/(1 - )) for successful searches. Random collision resolution Assumptions: a very large table and each probe is independent of the previous probes. The expected number of probes in an unsuccessful search is 1/(1 - ). Since changes from 0 to its current value, earlier insertions are cheaper and accessing it should be easier than accessing a recently inserted element. Fig (P.159) (compare the performance of linear probing (dashed curves) with random collision resolution) Linear probing can be a bad idea if the table is expected to be more than half full. Page 19

20 4.2 Quadratic Probing Eliminate the primary clustering problem of linear probing. The popular choice is F(i) = i 2 Fig (P.160) (inserting keys 89, 18, 49, 58, 69) There is no guarantee of finding an empty cell once the table gets more than half full, or even before the table gets half full if the table size is not prime. If the table size is not prime, the number of alternative locations can be severely reduced. Standard deletion cannot be performed in an open addressing hash table, because the cell might have caused a collision to go past it. (require lazy deletion) Fig (P ) Find(Key, H) (fig.5.16) Return the position of Key in the hash table. If Key is not present, then Find will return the last cell. (This cell is where the Key would be inserted if needed.) Secondary clustering Elements that hash to the same position will probe the same alternative cells. Double hashing eliminates this (at the cost of extra multiplications and divisions) 4.3 Double Hashing One popular choice is F(i) = i * hash 2 (X) Page 20

21 The hash 2 function must never evaluate to zero. The function below will work well: hash 2 (X) = R - (X mod R) where R is a prime smaller than TableSize. Example: Choose R = 7 and insert the keys 89, 18, 49, 58, 69 Fig (P.165) If the table size is not prime, it is possible to run out of alternative locations prematurely. 5. REHASHING If the table gets too full, the running time for the operations will start taking too long and inserts might fail for open addressing hashing with quadratic resolution. Rehashing (1) Build another table that is about twice as big (with an associated new hash function). (2) Scan down the entire original hash table. (3) Compute the new hash value for each (nondeleted) element. (4) Insert the element in the new table. Example: Fig (P.166) (Elements 13, 15, 6, and 24 are inserted. Hash function is h(x) = X mod 7 Linear probing to resolve collisions) Fig (P.166) (Element 23 is inserted, the table is over 70% full) Fig (P.167) Page 21

22 (Table size is 17 because this is the first prime that is twice as large as the old table size. h(x) = X mod 17) Rehashing is a very expensive operation. The running time is O(N). There must have been N/2 Inserts prior to the last rehash. Rehashing can be implemented in several ways with quadratic probing: To rehash as soon as the table is half full. Rehash only when an insertion fails Rehash when the table reaches a certain load factor. Rehashing frees the programmer from worrying about the table size and is important because hash tables cannot be made arbitrary large in complex programs. Fig (P.168) Page 22

23 PART A (2 MARKS) 1. DEFINE AVL TREES. Is identical to a binary search tree, except that for every node in the tree, the height of the left and right subtrees can differ by at most GIVE EXAMPLE FOR SINGLE ROTATION. Fig (P.113) (insert 6 into the original AVL tree on the left) 3. GIVE EXAMPLE FOR DOUBLE ROTATION. To continue the example in 4.1 by inserting the keys 10 through 16 in reverse order, followed by 8 and then 9. Diagrams in P WRITE ANY TWO OPERATIONS FOR PRIORITY QUEUES (HEAPS). Insert DeleteMin: find, return and remove the minimum element in the priority queue. 5. DEFINE BINARY HEAP. It is common for priority queue implementations. Two properties: A structure property A heap order property Page 23

24 6. DEFINE COMPLETE BINARY TREE. A heap is a binary tree that is completely filled, with the possible exception of the bottom level, which is filled from left to right. (complete binary tree) 7. WHAT IS THE REPRESENTATION OF HEAP DATA STRUCTURE. An array. An integer representing the maximum heap size. An integer representing the current heap size. 8. WHAT ARE THE OTHER HEAP OPERATIONS. DecreaseKey IncreaseKey 9. MENTION SOME APPLICATIONS OF PRIORITY QUEUES. The Selection Problem Event Simulation 10. DEFINE D-HEAPS. A d-heap is much shallower than a binary heap, improving the running time of Inserts to O(log d N). 11. MENTION THE TWO MOST WEAKNESSES OF HEAP IMPLEMENTATION. Inability to Finds Merging is a hard operation. 12. WHAT ARE THE.B-TREES OPERATIONS. Insertion operation of B-trees (order 3) Deletion operation Page 24

25 13. DEFINE HASHING. The ideal hash table data structure is merely an array of some fixed size, containing the keys. Each key is mapped into some number in the range 0 to TableSize - 1 and placed in the appropriate cell. (Hash function) A hash function should: Be simple to compute Ensure that any two distinct keys get different cells. 14. DEFINE DOUBLE HASHING. One popular choice is F(i) = i * hash 2 (X) The hash 2 function must never evaluate to zero. The function below will work well: hash 2 (X) = R - (X mod R) where R is a prime smaller than TableSize 15. WHAT IS THE DRAWBACK OF FUNCTION 2 IN HASH FUNCTION (KEYS ARE STRINGS). Drawback: English is not random. Only 28 percent of the table can actually be hashed to (assume the table size is 10,007.) 16. WHAT ARE THE USES OF SEPARATE CHAINING. To perform a Find To perform an Insert Page 25

26 17. WHAT IS THE DISADVANTAGE OF INSERT PERFORMANCE IN SEPARATE CHAINING. Require pointers (slow the algorithm down a bit) Require the implementation of a second data structure (list) 18. DEFINE OPEN ADDRESSING. If a collision occurs, alternative cells are tried until an empty cell is found. Cells h 0 (X), h 1 (X), h 2 (X),... are tried in succession, where h i (X) = (Hash(X) + F(i)) mod TableSize, with F(0) = WHAT ARE THE THREE COMMON COLLISION RESOLUTION STRATEGIES. Linear Probing Quadratic Probing Double Hashing 20. DEFINE REHASHING. Build another table that is about twice as big (with an associated new hash function). Scan down the entire original hash table. Compute the new hash value for each (nondeleted) element. Insert the element in the new table. Page 26

27 PART- B (16 MARKS) 1. EXPLAIN THE OPERATION AND IMPLEMENTATION OF BINARY HEAP. 2. EXPLAIN THE IMPLEMENTATION OF DIFFERENT HASHING TECHNIQUES. 3. (a) HOW DO YOU INSERT AN ELEMENT IN A BINARY SEARCH TREE? (8) (b) SHOW THAT FOR THE PERFECT BINARY TREE OF HEIGHT H CONTAINING2H+1-1 NODES, THE SUM OF THE HEIGHTS OF THE NODES 2H+1-1-1(H+1). (8) 4. GIVEN INPUT {4371,1323,6173,4199,4344,9679,1989} AND A HASH FUNCTION H(X)=X(MOD10), SHOW THE RESULTING: (A) SEPARATE CHAINING TABLE (4) (B) OPEN ADDRESSING HASH TABLE USING LINEAR PROBING (4) (C) OPEN ADDRESSING HASH TABLE USING QUADRATIC PROBING (4) (D) OPEN ADDRESSING HASH TABLE WITH SECOND HASH FUNCTION H2(X) =7-(X MOD 7). (4) 5. EXPLAIN IN DETAIL (I) SINGLE ROTATION (II) DOUBLE ROTATION OF AN AVL TREE. 6. EXPLAIN THE EFFICIENT IMPLEMENTATION OF THE PRIORITY QUEUE ADT. 7. EXPLAIN HOW TO FIND A MAXIMUM ELEMENT AND MINIMUM ELEMENT IN BST? EXPLAIN DETAIL ABOUT DELETION IN BINARY SEARCH TREE? Page 27

28 OBJECTIVE TYPE QUESTIONS 1. In heap operations of decrease key done with a a) Percolate up b) Percolate down c) Max d) Min 2. In heap operations of increase key done with a a) Percolate up b) Percolate down c) Max d) Min 3. B-Trees leaves are having same a) Height b) Depth c) Path d) Child 4. In B-Trees the actual data are stored at the a) Child b) Leaves c) Sibling d) Root 5. In hash function addition could be replaced with a bitwise exclusive for increased speed. a) AND b) OR c) NOT d) NAND 6. Expensive operation in hashing is a) Hashing b) Rehashing c) Double Hashing 7. Hash function keys are distribute a) Random b) Even c) Alternative Page 28

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure Hashing 1 Hash Tables We ll discuss the hash table ADT which supports only a subset of the operations allowed by binary search trees. The implementation of hash tables is called hashing. Hashing is a technique