DATA STRUCTURES/UNIT 3

Size: px

Start display at page:

Download "DATA STRUCTURES/UNIT 3"

Stuart Porter
5 years ago
Views:

1 UNIT III SORTING AND SEARCHING 9 General Background Exchange sorts Selection and Tree Sorting Insertion Sorts Merge and Radix Sorts Basic Search Techniques Tree Searching General Search Trees- Hashing. Introduction Sorting and Searching are fundamental operations in computer science. Sorting refers to the operation of arranging data in some given order. Searching refers to the operation of searching the particular record from the existing information. Normally, the information retrieval involves searching, sorting and merging. In this chapter we will discuss the searching and sorting techniques in detail. After going through this unit you will be able to: SEARCHING Know the fundamentals of sorting techniques Know the different searching techniques Discuss the algorithms of internal sorting and external sorting Difference between internal sorting and external sorting Complexity of each sorting techniques Discuss the algorithms of various searching techniques Discuss Merge sort Discuss algorithms of sequential search, binary search and binary tree search. Analyze the performance of searching methods Searching refers to the operation of finding the location of a given item in a collection of items. The search is said to be successful if ITEM does appear in DATA and unsuccessful otherwise. The following searching algorithms are discussed in this chapter. 1. Sequential Searching 2. Binary Search CCET/MCA Page 1

2 3. Binary Tree Search Sequential Search This is the most natural searching method. The most intuitive way to search for a given ITEM in DATA is to compare ITEM with each element of DATA one by one..the algorithm for a sequential search procedure is now presented. Algorithm SEQUENTIAL SEARCH INPUT : List of Size N. Target Value T OUTPUT : Position of T in the list-1 BEGIN Set FOUND = false Set I := 0 While (I <= N) and (FOUND is false) IF List[i] ==t THEN FOUND = true ELSE I = I+1 IF FOUND==false THEN T is not present in the List END Binary Search Suppose DATA is an array which is sorted in increasing numerical order. Then there is an extremely efficient searching algorithm, called binary search, which can be used to find the location LOC of a given ITEM of information in DATA. The binary search algorithm applied to our array DATA works as follows. During each stage of our algorithm, our search for ITEM is reduced to a segment of elements of DATA: DATA[BEG], DATA[BEG + 1], DATA[BEG + 2],... DATA[END]. CCET/MCA Page 2

3 Note that the variable BEG and END denote the beginning and end locations of the segment respectively. The algorithm compares ITEM with the middle element DATA[MID] of the segment, where MID is obtained by MID = INT((BEG + END) / 2) (We use INT(A) for the integer value of A.) If DATA[MID] = ITEM, then the search is successful and we set LOC: = MID. Otherwise a new segment of DATA is obtained as follows: (a) If ITEM < DATA[MID], then ITEM can appear only in the left half of the segment: DATA[BEG],DATA[BEG + 1],..,DATA[MID - 1]So we reset END := MID - 1 and begin searching again. (b) If ITEM > DATA[MID], then ITEM can appear only in the right half of the segment: DATA[MID + 1], DATA[MID + 2],...,DATA[END] So we reset BEG := MID + 1 and begin searching again. Initially, we begin with the entire array DATA; i.e. we begin with BEG = 1 and END = n, If ITEM is not in DATA, then eventually we obtain END<BEG This condition signals that the search is unsuccessful, and in this case we assign LOC: = NULL. Here NULL is a value that lies outside the set of indices of DATA. We now formally state the binary search algorithm. Algorithm 2.9: (Binary Search) BINARY(DATA, LB, UB, TEM, LOC) Here DATA is sorted array with lower bound LB and upper bound UB, and ITEM is a given item of information. The variables BEG, END and MID denote, respectively, the beginning, end and middle locations of a segment of elements of DATA. This algorithm finds the location LOC of ITEM in DATA or sets LOC=NULL. 1. [Initialize segment variables.] Set BEG := LB, END := UB and MID = INT((BEG + END)/ 2). 2. Repeat Steps 3 and 4 while BEG END and DATA[MID] ITEM. 3. If ITEM<DATA[MID], then: CCET/MCA Page 3

4 Set END := MID - 1. Else: Set BEG := MID + 1. [End of If structure] 4. Set MID := INT((BEG + END)/2). [End of Step 2 loop.] 5. If DATA[MID] :=ITEM, then: Set LOC:=MID. Else: Set LOC := NULL. [End of If structure.] 6. Exit. Example 2.9 Let DATA be the following sorted 13-element array: DATA: 11, 22, 30, 33, 40, 44, 55, 60, 66, 77, 80, 88, 99 We apply the binary search to DATA for different values of ITEM. (a) Suppose ITEM = 40. The search for ITEM in the array DATA is pictured in where the values of DATA[BEG] and DATA[END] in each stage of the algorithm are indicated by parenthesis and- the value of DATA[MID] by a bold. Specifically, BEG, END and MID will have the following successive values: (1) Initially, BEG = 1 and END 13. Hence, MID = INT[(1 + 13) / 2 ] = 7 and so DATA[MID] = 55 (2) Since 40 < 55, END = MID 1 = 6. Hence, MID = INT[(1 + 6) / 2 ] = 3 and so DATA[MID] = 30 (3) Since 40 > 30, BEG = MID + 1 = 4. Hence, MID = INT[(4 + 6) / 2 ] = 5 and so DATA[MID] = 40 The search is successful and LOC = MID = 5. (1) (11), 22, 30, 33, 40, 44, 55, 60, 66, 77, 80, 88, (99) (2) (11), 22, 30, 33, 40, (44), 55, 60, 66, 77, 80, 88, 99 (3) 11, , (33), 40, (44), 55, 60, 66, 77, 80, 88, 99 [Successful] CCET/MCA Page 4

5 Complexity of the Binary Search Algorithm The complexity is measured by the number of comparisons f(n) to locate ITEM in DATA where DATA contains n elements. Observe that each comparison reduces the sample size in half. Hence we require at most f(n) comparisons to locate ITEM where f(n) = [log2n] + 1 That is the running time for the worst case is approximately equal to log2n. The running time for the average case is approximately equal to the running time for the worst case. Limitations of the Binary Search Algorithm The algorithm requires two conditions: (1) the list must be sorted and (2) one must have direct access to the middle element in any sublist. Binary Search Tree Suppose T is a binary tree. Then T is called a binary search tree if each node N of T has the following property: The value at N is greater than every value in the left subtree of N and is less than every value in the right subtree of N. Binary Search Tree(T) SEARCHING AND INSERTING IN BINARY SEARCH TREES Suppose an ITEM of information is given. The following algorithm finds the location of ITEM in the binary search tree T, or inserts ITEM as a new node in its appropriate place in the tree. (a) Compare ITEM with the root node N of the tree. CCET/MCA Page 5

6 (i) If ITEM < N, proceed to the left child of N. (ii) If ITEM > N, proceed to the right child of N. (b) Repeat Step (a) until one of the following occurs: (i) We meet a node N such that ITEM = N. In this case the search is successful. (ii) We meet an empty subtree, which indicates that the search is unsuccessful, and we insert ITEM in place of the empty subtree. In other words, proceed from the root R down through the tree T until finding ITEM in T or inserting ITEM as a terminal node in T. Example 2.11 Consider the binary search tree T in Fig Suppose ITEM = 20 is given. Compare ITEM = 20 with the root, 38, of the tree T. Since 20 < 38, proceed to the left child of 38, which is Compare ITEM = 20 with 14. Since 20 > 14, proceed to the right child of 14, which is Compare ITEM = 20 with 23. Since 20 < 23, proceed to the left child of 23, which is Compare ITEM = 20 with 18. Since 20 > 18 and 18 does not have a right child, insert 20 as the right child of 18. ITEM=20 inserted DELETING IN A BINARY SEARCH TREE Suppose T is a binary search tree, and suppose an ITEM of information is given. This section gives an algorithm which deletes ITEM from the tree T. CCET/MCA Page 6

7 Case 1. N has no children. Then N is deleted from T by simply replacing the location of N in the parent node P(N) by the null pointer. Case 2. N has exactly one child. Then N is deleted from T by simply replacing the location of N in P(N) by the location of the only child of N. Case 3. N has two children. Let S(N) denote the inorder successor of N. ( The reader can verify that S(N) does not have a left child.) Then N is deleted from T by first deleting S(N) from T (by using Case 1 or Case 2) and then replacing node N in T by the node S(N). Observe that the third case is much more complicated than the first two cases. In all three cases, the memory space of the deleted node N is returned to the AVAIL list. (a) Before deletions. (b) Linked representation. CCET/MCA Page 7

(a) Node 44 is deleted (b) Linked representation Sorting Methods The function of sorting or ordering a list of objects according to some linear order is so fundamental that it is ubiquitous in

8 (a) Node 44 is deleted (b) Linked representation Sorting Methods The function of sorting or ordering a list of objects according to some linear order is so fundamental that it is ubiquitous in engineering applications in all disciplines. There are two broad categories of sorting methods: Internal sorting takes place in the main memory, where we can take advantage of the random access nature of the main memory. External sorting is necessary when the number and size of objects are prohibitive to be accommodated in the main memory. Given records r1, r2,..., rn, with key values k1, k2,..., kn, produce the records in the order ri1, ri2,..., rin, Such that ki1 ki2... kin CCET/MCA Page 8

9 The complexity of a sorting algorithm can be measured in terms of number of algorithm steps to sort n records number of comparisons between keys (appropriate when the keys are long character strings)number of times records must be moved (appropriate when record size is large) Any sorting algorithm that uses comparisons of keys needs at least O(n log n) time to accomplish the sorting. Sorting Methods Internal External (In memory) Appropriate for secondary storage quick sort heap sort Mergesort bubble sort radix sort insertion sort Polyphase sort selection sort shell sort Insertion Sort The general idea of the insertion sort method is that for each element, find the slot where it belongs. Example The element in position Array[0] is certainly sorted. Thus, move on to insert the second character, D, into the appropriate location to maintain the alphabetical order. How does it work? Each element Array[j] is taken one at a time from j = 0 to n-1. Before insertion of Array[j], the subarray from Array[0] to Array[j-1] is sorted, and the remainder of the array is not. After insertion, Array[0 j] is correctly ordered, while the subarray with elements Array[j+1]..Array[n-1] is unsorted. Insertion Sort Algorithm CCET/MCA Page 9

10 for i = 1 to n-1 temp = a[i] loc = i while(loc>0 && (a[loc-1]> temp) a[loc] = a[loc-1] loc = loc 1 a[loc] = temp Insertion sort The initial state is that the first element,considered by itself, is sorted The final state is that all elements, considered as a group, are sorted. Basic action is to arrange that elements in positions 0 through i. In each stage i increases by 1. The outer loop controls this. When the body of the outer for loop is entered,we know that elements at positions 0 through i are sorted and we need to extend this to positions 0 to n-1. At each step the element indexed by i needs to be added to the sorted part of the array. This is done by placing it in a temporary variable and sliding all elements larger than it one position to the right. Then the temporary element is copied into the leftmost relocated element. The counter loc indicates this position. Complexity Best situation: the data is already sorted. The inner loop is never executed, and the outer loop is executed n 1 times for total complexity of O(n). Worst situation: data in reverse order. The inner loop is executed the maximum number of times. Thus the complexity of the insertion sort in this worst possible case is quadratic or O(n2). Selection Sort In this sorting we find the smallest element in this list and put it in the first position. Then find the second smallest element in the list and put it in the second position. And so on. Pass 1. Find the location LOC of the smallest in the list of N elementsa[l], A[2],..., A[N], and then interchange A[LOC] and [1].Then A[1] is sorted. CCET/MCA Page 10

11 Pass 2. Find the location LOC of the smallest in the sublist of N 1Elements A[2], A[3],..., A[N], and then interchangea[loc]and A[2]. Then:A[l], A[2] is sorted, since A[1]<A[2]. Pass 3. Find the location LOC of the smallest in the sublist of N 2elements A[3], A[4],..., A[N], and then interchange A[LOC] and A[3]. Then: A[l], A[2],..., A[3] is sorted, since A[2] <A[3]. Pass N - 1. Find the location LOC of the smaller of the elements A[N - 1), A[N], and then interchange A[LOC] and A[N- 1]. Then: A[l], A[2],..., A[N] is sorted, since A[N - 1] < A[N].Thus A is sorted after N - 1 passes. Hashing Accessing elements in an array is extremely efficient. Array elements are accessed by index. If we can find a mapping between the search keys and indices, we can store each record in the element with the corresponding index. Thus each element would be found with one operation only. Advantage: the records can be references directly - ideally the search time is a constant, complexity O(1) Question: how to find such correspondence? Answers: direct access tables hash tables Direct-address tables Direct-address tables the most elementary form of hashing. Assumption direct one-to-one correspondence between the keys and numbers 0, 1,, m-1., m not very large. Searching is fast, but there is cost the size of the array we need is the size of the largest key. Not very useful if only a few keys are widely distributed. CCET/MCA Page 11

12 Hash functions Hash function: function that transforms the search key into a table address. Hash functions transform the keys into numbers within a predetermined interval. These numbers are then used as indices in an array (table, hash table) to store the records Keys numbers. If M is the size of the array, then h(key) = key % M. This will map all the keys into numbers within the interval [0, M-1]. Keys strings of characters Treat the binary representation of a key as a number, and then apply the first case. How keys are treated as numbers: If each character is represented with m bits, then the string can be treated as base-2 m number. Hash tables: Basic concepts Once we have found the method of mapping keys to indexes, the questions to be solved is how to choose the size of the table (array) to store the records, and how to perform the basic operations: Insert Search delete Let N be the number of the records to be stored, and M - the size of the array (hash table). The integer, generated by a hash function between 0 and M-1 is used as an index in a hash table of M elements. Initially all slots in the table are blank. This is shown either by a sentinel value, or a special field in each slot. To insert use the hash function to generate an address for each value to be inserted. CCET/MCA Page 12

13 To search for a key in the table the same hash function is used. To delete a record with a given key - first we apply the search method and when the key is found we delete the record. Size of the table: Ideally we would like to store N records in a table with size N. However, in many cases we don't know in advance the exact number of records. Also, the hash function can map two keys to one and the same index, and some cells in the array will not be used. Hence we assume that the size of the table can be different from the number of the records. We use M to denote the size of the table. A characteristic of the hash table is its load factor L = N/M: the ratio between the number of records to be stored and the size of the table. The method to choose the size of the table depends on the chosen method of collision resolution, discussed below. M should be a prime number. It has been proved that if M is a prime number, we obtain better (more even) distribution of the keys over the table. Collision resolution Collision is the case when two or more keys hash to one and the same index in the hash table. Collision resolution deals with keys that are mapped to same indexes. Methods: Separate chaining Open addressing o Linear probing o Quadric probing o Double hashing CCET/MCA Page 13

14 SEPARATE CHAINING Complexity of separate chaining The time to compute the index of a given key is a constant. Then we have to search in a list for the record. Therefore the time depends on the length of the lists. It has been shown empirically that on average the list length is N/M (the load factor L), provided M is prime and we use a function that gives good distribution. Unsuccessful searches go to the end of some list, hence we have L comparisons. Successful searches are expected to go half the way down some list. On average the number of comparisons in successful search is L/2. Therefore we can say that runtime complexity of separate chaining is O(L).Note, that what really matters is the load factor rather than the size of the table or the number of records, taken separately. How to choose M in separate chaining? Since the method is used in cases when we cannot predict the number of records in advance, the choice of M basically depends on other factors such as available memory. Typically M is chosen relatively small so as not to use up a large area of contiguous memory, but enough large so that the lists are short for more efficient sequential search. Recommendations in the literature vary form M to be about one tenth of N - the number of the records to M to be equal (or close to) N. Other methods of chaining: Keep the lists ordered: useful if there are much more searches than inserts, and if most of the searches are unsuccessful. Represent the chains as binary search tree. Extra effort needed not efficient. Advantages of separate chaining used when memory is of concern, easily implemented. Disadvantages unevenly distributed keys long lists and many empty spaces in the table. CCET/MCA Page 14

15 1. Open addressing Invented by A. P. Ershov and W. W. Peterson in 1957 independently. Idea: Store collisions in the hash table itself. The method uses a collision resolution function in addition to the hash functon. If collision occurs, next probes are performed following the formula: h i (x) = (hash(x) + f(i)) mod TableSize where: hash(x) is the hash function f(i) is the collision resolution function i is the number of the current attempt (probe) to insert an element. a. Linear probing (linear hashing, sequential probing): f(i) = i Insert: When there is a collision we just probe the next slot in the table. If it is unoccupied we store the key there. If it is occupied we continue probing the next slot. Search: If the key hashes to a position that is occupied and there is no match, we probe the next position. a) match successful search b) empty position unsuccessful search c) occupied and no match continue probing. When the end of the table is reached, the probing continues from the beginning, until the original starting position is reached. Problems with delete: a special flag is needed to distinguish deleted from empty positions. This is necessary for the search function if we come to a "deleted" position, CCET/MCA Page 15

16 the search has to continue as the deletion might have been done after the insertion of the key we are looking for, and it might be further in the table. Here is an example of Linear probing Total amount of memory space less, since no pointers are maintained. Disadvantage: " Primary clustering" Large clusters tend to build up: if an empty slot is preceded by i filled slots, the probability that the empty slot is the next one to be filled is (i+1)/m. If the preceding slot was empty, the probability is 1/M. This means that when the table begins to fill up, many other slots are examined. Linear probing runs slowly for nearly full tables. Quadratic probing: f(i) = i 2 A guadratic function is used to compute the next index in the table to be probed. Example: In linear probing we check the i-th position. If it is occupied, we check the i+1 st position, next we check the i+2 nd, etc. In quadric probing, if the i-th position is occupied we check the i+1 st, next we check the i+4 th, next - i + 9 th etc. The idea here is to skip regions in the table with possible clusters. Double hashing: f(i) = i * hash 2 (x) Purpose same as in quadratic probing : to overcome the disadvantage of clustering. Instead of examining each successive entry following a collided position, we use a second hash function to get a fixed increment for the "probe" sequence. CCET/MCA Page 16

17 The second function should be chosen so that the increment and M are relatively prime. Otherwise there will be slots that would remain unexamined. Example: hash 2 (x) = R - (x mod R), R is smaller than TableSize, prime. In open addressing the load factor L is less than 1. Good strategy is to keep L < 0.5 If the table is close to full, the search time grows and may become equal to the table size Rehashing If the table is close to full, the search time grows and may become equal to the table size. When the load factor exceeds a certain value (e.g. greater than 0.5) we do rehashing : Build a second table twice as large as the original and rehash there all the keys of the original table. Rehashing is expensive operation, with running time O(N) However, once done, the new hash table will have good performance. Extendible hashing Used when the amount of data is too large to fit in main memory and external storage is used. N records in total to store, M records in one disk block The problem: in ordinary hashing several disk blocks may be examined to find an element - a time consuming process. Extendible hashing: no more than two blocks are examined. Idea: Keys are grouped according to the first m bits in their code. Each group is stored in one disk block. CCET/MCA Page 17

18 If the block becomes full and no more records can be inserted, each group is split into two, and m+1 bits are considered to determine the location of a record. Example: lets' say we have 4 groups of keys according to the first two bits: Each disk block in the example can contain 3 records only, 4 blocks are needed to store the above keys New key to be inserted: Block2 is full, so we start considering 3 bits: 000/ / /111 (still on same block) The second group of keys is split onto two disk blocks - one for keys staring with 010, and one for keys starting with 011. A directory is maintained in main memory with pointers to the disk blocks for each bit pattern. The size of the directory is 2 D = O(N (1+1/M) /M), where CCET/MCA Page 18

19 D - number of bits considered N - number of records M - number of disk blocks. Conclusion Hashing is the best search method (constant running time) if we don't need to have the records sorted. The choice of the hash function remains the most difficult part of the task and depends very much on the nature of the keys. Separate chaining or open addressing? Open addressing is the preferred method if there is enough memory to keep a table twice larger than the number of the records. Separate chaining is used when we don't know in advance the number of the records to be stored. Though it requires additional time for list processing, it is simpler to implement. Some application areas Dictionaries, on-line spell checkers, compiler symbol tables. CCET/MCA Page 19

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms Hashing CmSc 250 Introduction to Algorithms 1. Introduction Hashing is a method of storing elements in a table in a way that reduces the time for search. Elements are assumed to be records with several