CS 2412 Data Structures Chapter 10 Sorting and Searching

Some concepts Sorting is one of the most common data-processing applications. Sorting algorithms are classed as either internal or external. Sorting order can be either ascending sequence or descending sequence. Sort stability is an attribute of a sort, indicating that data with equal keys maintain their relative input order in the output. Sort efficiency usually is based on the comparisons and moves required for the sorting. The best possible sorting algorithms are O(n log n). During the sorting process, each traversal of the data is referred to as a sort pass.

Selection sorts Heap sort: we have already discussed. First build a heap. Then remove the root of the heap and put the last element to the root and reheap down. Straight selection sort: In each pass of the selection sort, the smallest element is selected from the unsorted sublist and exchange with the element at the beginning of the unsorted list.

Algorithm selectionsort (list, last) set current to 0 loop (until last element sorted) set smallest to current set walker to current +1 loop (walker key < smallest key) set smallest to walker increment walker end loop exchange (current, smallest) increment current end loop

The efficiency of selection sort Straight select sort: O(n 2 ). The algorithm has two level of loops, each of the loop executes about n times. Heap sort: O(n log n). To build a heap, about n log n loops are needed. To sort from the heap needs another n log n loops. In big-o notation, the complexity is O(n log n).

Insertion sorts Straight insertion sort: the list is divided into sorted and unsorted sublists. In each pass the first element of the unsorted sublist is inserted into the sorted sublist at correct position. Shell sort: the list is divided into K segments and each segment is sorting (the segments are dispersed through the list). After each passing, the number of segments is reduced according to a increment. When the number of segments is reduced to 1, the list is sorted.

Algorithm insertionsort(list, last) set current to 1 loop (until last element sorted) move current element to hold set walker to current - 1 loop (walker >= 0 AND hold key < walker key) move walker element right one element decrement walker end loop move hold to walker + 1 element increment current end loop

The main idea for the Shell sort is divide the list into segments and use insertion sort to sort each segment. The positions of the elements of a segment are at a distance of increment. In the following example, the list is of size 10. The 5 segments for increment K = 5 are as follows: Segment 1. Segment 2. Segment 3. Segment 4. Segment 5. A[0], A[5] A[1], A[6] A[2], A[7] A[3], A[8] A[4], A[9] Then for increment K = 2 Segment 1. Segment 2. A[0], A[2], A[4], A[6], A[8] A[1], A[3], A[5], A[7], A[9]

Algorithm shellsort (list, last) set incre to last / 2 loop (incre not 0) set current to incre loop(until last element sorted) move current element to hold set walker to current - incre loop (walker>=0 AND hold key < walker key) move walker element one increment right set walker to walker - incre end loop move hold to walker + incre element increment current end loop set incre to incre / 2 end loop

void shellsort (int list [], int last) { int hold; int incre; int walker; incre = last / 2; while (incre!= 0) { for (int curr = incre; curr <= last; curr++) { hold = list [curr]; walker = curr - incre; while (walker >= 0 && hold < list [walker]) { list [walker + incre] = list [walker]; walker = ( walker - incre );

} // while list [walker + incre] = hold; } // for walk incre = incre / 2; } // while return; } // shellsort Note In the above algorithm, the increment start from n/2, then each pass reduce half of the size. This is not the most efficient way, but simple. The ideal increments should be set so that no two elements will appear at same segment more than once. But this is not easy in general.

Insertion sort efficiency: Straight insertion sort: O(n 2 ). The algorithm has two embedded loops. The execute times is about n(n + 1)/2. Shell sort: the complexity is difficult to analysis. Using empirical studies show that the average sort complexity is O(n 1.25 )

Exchange sorts Bubble sort: the list in divided into two sublists: sorted and unsorted. The smallest element is bubbled from the unsorted sublist to the sorted sublist each time. Quick sort: each time a pivot is selected. Then the elements less than pivot and the elements greater or equal to pivot are separated into two sublist. The pivot is put at its ultimately correct location in the list.

Algorithm bubblesort(list, last) set current to 0 set sorted to false loop (current <= last AND sorted false) set walker to last set sorted to true loop (walker > current) if (walker dta < walker -1 data) set sorted to false exchange (list, walker, walker -1) end if decrement walker end loop increment current end loop

Note for quick sort There are different methods for selecting the pivot. Select the first element. Select the middle element. Select the median value of three elements: left, right and the element in the middle of the list. This text uses this method. When the partition becomes small, a straight insertion sort can be used, which may be more efficient.

22 Example for one pass of a quick sort: Data Structure 2016 R. Wei 22

Algorithm medianleft(sortdata, left, right) set mid to (left + right ) /2 if (left key > mid key) exchange (sortdata, left, mid) end if if (left key > right key) exchange ( sortdata, left, right) end if if(mid key > right key) exchange (sortdata, mid, right) end if exchange (sortdata, left, mid) //put pivot in left.

26 The list in Figure is sorted as follows: Data Structure 2016 R. Wei 26

The exchange sort efficiency: Bubble sort: O(n 2 ). There are two loops in the algorithm. The comparison is about n(n + 1)/2. Quick sort: O(n log n). The algorithm has 5 loops. However, for each pass, the partition is general half size as previous pass. Roughly say, there are total log 2 n passes.

void bubblesort (int list [], int last) { int temp; for (int current = 0, sorted = 0; current <= last &&!sorted; current++) for (int walker = last, sorted = 1; walker > current; walker--) if (list[ walker ] < list[ walker - 1 ]) { sorted = 0; temp = list[walker]; list[walker] = list[walker - 1]; list[walker - 1] = temp; } // if return; } // bubblesort

External sorts In external sorting, portions of the data may be stored in secondary memory during the sorting process. One important method for the external sort is merge the (sorted) files in to one sorted file.

Merge sorts A simple merge is merge two sorted files into one file. For example, we have two sorted lists: 1, 3, 5 2, 4, 6, 8, 10 After we merged these two list, we should obtain the following list: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

The following algorithm merges two sorted files file1, file2. The combined data are written into file3 Algorithm mergefiles open files read (file1 into record1) read (file2 into record2) loop (not end file1 or not end file2) if (record1.key <= record2.key) write (record1 to file3) read (file1 into record1) if (end of file1) set record1.key to infinity end if else write (record2 to file3)

read (file2 into record2) if (end of file2) set record2 key to infinity end if end if end loop close files end mergefiles

Merge unsorted files: Form merge runs for the files. Each run is ordered. The end of each run is identified by a stepdown. Merge each run of the two files. When one run is stepdown, the another run is rollout (copied to the merged file).

The sorting process: Sort phase: Divide the file into merge files according to the size of memory. Foe example, if we have 2300 records, but the memory only can handle 500 records. We first read in 500 records and sort it as the first merge run. Then read and sort records as first run of the merge 2, etc. Merge phase: merge the sorted runs.

There are different merge concepts. We discuss 3 of them as examples Natural merge: after merge, all data are written in one file and need a distribute phase to redistribute the data to two files. Balance merge: use a constant number of input merge files and the same number of output merger files. Ployphase merge: A constant number of input merge files are merged to one output merge file, the input merge files are immediately reused when their input has been completely merged.

Searching Binary search: for sorted list. Sequential search: Straight sequential search: each time check if the key equals to the target AND if it is the last key. Sentinel sequential search: add the target at the end of the list so that each time just check if key equals to the target. Probability search: when a target is found, move the element containing target up one location. In this way, most frequent targets are easier to found.

Hashed list searches Hashing is a method using key-to-address mapping to find the data quickly. The basic idea is using a hash function to map a key (which is at a large range) to a index (which is at a small range) of data. Some keys may be mapped to a same index (synonyms). Then we need some method to solve the collision. The main part of hashing is to find good hashing methods.

Hashing methods: Direct method: the range of keys and the range of index are the same. Subtraction method: subtract a fixed number from the key. Also require both ranges are the same. Modulo-division method: index= key MODULO listsize Digit-extraction method: select digits at certain positions as the index. Midsquare method: key is squared and the middle digits are used as index.

Folding method: fold shift (key is divided into parts whose size matches the size of the index. Then the left and right parts are shifted and added with the middle part); fold boundary (the left and right numbers are folded on a fixed boundary between them and the center number. The two outside values are reversed).

Rotation method: rotating the last character to the front of the key. Usually used by incorporating with other methods. Pseudorandom method: the key is used as the seed in a pseudorandom number generator, the resulting random number is then scaled into the possible index range.

Some concepts used in collision resolution method: Load factor: the number of elements in the list divided by the number of physical allocated for the list, expressed as percentage (better less than 75). α = k n 100. Clustering: as data are added to a list and collisions are resolved, some hashing algorithms tend to cause data to group within the list.

Open addressing to resolve collisions (disadvantage: each collision resolution increases the probability of future collisions). Linear probe: when data cannot be stored in the home address, we resolve the collision by adding 1 to the current address.

Quadratic probe: the increment is the collision probe number squared.

Pseudorandom collision resolution (double hashing): use a pseudorandom number to resolve the collision. Use the collision address as the key of the the pseudorandom generator.

Key offset (double hashing): calculate the new address as a function of the old address and the key. For example: offset = key / listsize address = (offset + old address) modulo listsize

Linked list collision resolution: use a separate area to store collisions and chains all synonyms together in a linked list (usually use LIFO sequence). Two storage areas are used: prime area and the overflow area.

Bucket hashing: keys are hashed to buckets, nodes that accommodate multiple data occurrences. (disadvantage: use more empty space, when the bucket is full, collision occurs)

Combination approaches may used: bucket hashing first, then a linear probe is used if bucket is full.

More information