Advanced Algorithms and Data Structures Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Prerequisites A seven credit unit course Replaced OHJ-2156 Analysis of Algorithms We take things a bit further than OHJ-2156 We will assume familiarity with Necessary mathematics Elementary data structures Programming MAT-72006 AADS, Fall 2015 27-Aug-15 2 1
Course Basics There will be 4 hours of lectures per week Weekly exercises start in two weeks time We will not have a programming exercise this year (unless you demand to have one) We might consider organizing a seminar with voluntary presentations (yielding extra points) at the end of the course MAT-72006 AADS, Fall 2015 27-Aug-15 3 Organization & Timetable Lectures: Prof. Tapio Elomaa Tue & Thu 2 4 PM in TB223 (& TB219 in PII) Aug. 25 Dec. 3, 2015 Period break Oct. 12 18, 2015 Exercises: M.Sc. Juho Lauri & M.Sc. Teemu Heinimäki, Mon 10 12 TC315, Start: Sept. 9 Exam: Fri Dec. 18, 2015 MAT-72006 AADS, Fall 2015 27-Aug-15 4 2
Guest Lecture Dr. Timo Aho from Solita In October Big data Definitions of data analytics Methods Case studies MAT-72006 AADS, Fall 2015 27-Aug-15 5 Course Grading Exam: Maximum of 30 points Weekly exercises yield extra points 40% of questions answered: 1 point 80% answered: 6 points In between: linear scale (so that decimals are possible) Final grading depends on what we agree as course components MAT-72006 AADS, Fall 2015 27-Aug-15 6 3
Material The textbook of the course is Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms, 3rd ed., MIT Press, 2009 There is no prepared material, the slides appear in the web as the lectures proceed http://www.cs.tut.fi/~elomaa/teach/72006.html The exam is based on the lectures (i.e., not on the slides only) MAT-72006 AADS, Fall 2015 27-Aug-15 7 Content (Plan) I Foundations II (Sorting and) Order Statistics III Data Structures IV Advanced Design and Analysis Techniques V Advanced Data Structures VI Graph Algorithms VII Selected Topics MAT-72006 AADS, Fall 2015 27-Aug-15 8 4
I Foundations The Role of Algorithms in Computing Getting Started Growth of Functions Recurrences Probabilistic Analysis and Randomized Algorithms II (Sorting and) Order Statistics Heapsort Quicksort Sorting in Linear Time Medians and Order Statistics 5
III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees IV Advanced Design and Analysis Techniques Dynamic Programming Greedy Algorithms Amortized Analysis 6
V Advanced Data Structures B-Trees Binomial Heaps Fibonacci Heaps VI Graph Algorithms MAT-62756 GRAPH THEORY Elementary Graph Algorithms Minimum Spanning Trees Single-Source Shortest Paths All-Pairs Shortest Paths Maximum Flow 7
VII Selected Topics Matrix Operations Linear Programming Number-Theoretic Algorithms Approximation Algorithms Part of these could also be student presentation topics Online ChiMerge In machine learning and data mining algorithms one often needs to discretize numerical attributes There are initial intervals that cannot be divided (examples that have the same value for the attribute in question) CHIMERGE is intended as a preprocessing technique for learning algorithms It combines adjacent initial intervals bottom-up to come up with the final discretization of the value range of the numerical attribute in question MAT-72006 AADS, Fall 2015 27-Aug-15 16 8
Combine? Combine? Combine? Combine? Combine? Combine? Combine? Combine? Combine? MAT-72006 AADS, Fall 2015 27-Aug-15 17 The worst-case time complexity of the batch algorithm is lg ) The main requirement for the online version of CHIMERGE (OCM) is a processing time that is, per each received training example, logarithmic in the number of intervals Thus, we altogether match the lg ) time requirement of the batch version of CHIMERGE MAT-72006 AADS, Fall 2015 27-Aug-15 18 9
If each new example drawn from DATASOURCE would update the current discretization immediately, this would yield a slowing down of OCM by a factor as compared to the batch version Therefore, we update the active discretization only periodically, freezing it for the intermediate period Examples received during this time can be handled at a logarithmic time The price to be paid is maintaining some other data structures in addition to the BST MAT-72006 AADS, Fall 2015 27-Aug-15 19 OCM uses a balanced binary search tree into which each example drawn from DATASOURCE is (eventually) submitted The algorithm continuously takes in new examples from DATASOURCE, but instead of just storing them to, it updates the required data structures simultaneously We do not halt the ow of the data stream because of extra processing of the data structures, but amortize the required work to normal example handling BST plus a queue, two doubly linked lists, and a priority queue (heap) are maintained MAT-72006 AADS, Fall 2015 27-Aug-15 20 10
MAT-72006 AADS, Fall 2015 27-Aug-15 21 The discretization process works in three phases In Phase 1 (of length time steps), the examples from DATASOURCE are cached for temporary storage to the queue Meanwhile the initial intervals are collected one at a time from to a list that holds the intervals in their numerical order For each pair of two adjacent intervals, the value is computed and inserted into the priority queue The queue is needed to freeze for the duration of the Phase 1 MAT-72006 AADS, Fall 2015 27-Aug-15 22 11
Procedure OCM ) Input: Integer giving the number of examples on which initial discretization is based on and real which is the threshold for interval merging Data structures: A BST, a queue, two doubly linked lists and, and a priority queue 1. Initialize,,, and as empty; 2. Read training examples into tree ; 3. phase 1; TREEMINIM UM); 4. Make an empty priority queue of size 1; 5. while DATASOURCE() nil do MAT-72006 AADS, Fall 2015 27-Aug-15 23 6. if phase = 1 then ENQUEUE(,) 7. else TREEINSERT fi; 8. if phase = 1 then ---------------------- 9. Insert a copy of as the last item in ; 10. if >1then 11. Compute the -value of the two last intervals in ; 12. Using the obtained value as a key, add a pointer to the two intervals in into ; 13. fi 14. TREESUCCESSOR ); 15. if = nil then phase 2 fi MAT-72006 AADS, Fall 2015 27-Aug-15 24 12
After seeing examples, all the initial intervals in have been inserted to and all the values have been computed and added to At this point holds (unprocessed) examples At the end of Phase 1 all required preparations for interval merging have been done Phase 2 implements the merging as in batch CHIMERGE, but without stopping the processing of DATASOURCE As a result of each merging, the intervals in as well as the related -values in are updated MAT-72006 AADS, Fall 2015 27-Aug-15 25 In Phase 2 the example received from DATASOURCE is submitted directly to along with another, previously received example that is cached in The lowest -value,, corresponding to some adjacent pair of intervals and, is drawn from If it exceeds the merging threshold, Phase 2 is complete Otherwise, the intervals and are merged to obtain a new interval, and the -values on both sides of are updated into MAT-72006 AADS, Fall 2015 27-Aug-15 26 13
16. else if phase = 2 then ------------------- 17. DEQUEUE); TREEINSERT ); 18., EXTRACTMIN); 19. if nil and then 20. Remove from P the D-values of I and J; 21. LIST-DELETE ); LISTDELETE ); 22. MERGE ); 23. Compute the -values of with its neighbors in ; 24. Update the -values into priority queue ; 25. else 26. phase 3; fi MAT-72006 AADS, Fall 2015 27-Aug-15 27 27. else -------------Phase = 3 ------------ 28. DEQUEUE); TREE-INSERT ); 29. LIST-DELETE,HEAD ); 30. LIST-INSERT ); 31. if EMPTY) then 32. phase 1; TREE-MINIM UM); 33. Make an empty priority queue, size 1; 34. fi 35. fi 36. od MAT-72006 AADS, Fall 2015 27-Aug-15 28 14
Time Complexity of OCM The insertion of the received example to clearly takes constant time, as well as the insertion of an interval from to Also computing the -value can be considered a constant-time operation Inserting the obtained value to requires time (lg) Finding a successor in also takes time (lg) Thus, the total time consumption of one iteration of Phase 1 is (lg) MAT-72006 AADS, Fall 2015 27-Aug-15 29 Insertion of the new example and one from the queue to the BST both take time (lg) Extracting the lowest -value from the priority queue is also an lg time operation As contains pointers to the items holding intervals and in, removing them can be implemented in constant time Merging of and to obtain can be considered a constant-time operation as well as computing the new values for and its neighbors in Updating each of the new -values into takes lg time Thus, the total time consumption of one iteration of Phase 2 is also lg MAT-72006 AADS, Fall 2015 27-Aug-15 30 15
The sorting problem Input: A sequence of numbers,, Output: A permutation (reordering),, of the input sequence such that The numbers that we wish to sort are also known as keys MAT-72006 AADS, Fall 2015 27-Aug-15 31 INSERTION-SORT) 1. for 2to 2. 3. // Insert ] into the sorted sequence [1.. 1] 4. 5. while >0and ] > 6. + 1] ] 7. 1 8. [ + 1] MAT-72006 AADS, Fall 2015 27-Aug-15 32 16
5 2 4 6 1 3 2 5 4 6 1 3 2 4 5 6 1 3 2 4 5 6 1 3 1 2 4 5 6 3 1 2 3 4 5 6 MAT-72006 AADS, Fall 2015 27-Aug-15 33 Correctness of the Algorithm The following loop invariant helps us understand why the algorithm is correct: At the start of each iteration of the for loop of lines 1 8, the subarray [1.. 1] consists of the elements originally in [1.. 1] but in sorted order MAT-72006 AADS, Fall 2015 27-Aug-15 34 17
Initialization The loop invariant holds before the first loop iteration, when =2: The subarray, therefore, consists of just the single element [1] It is in fact the original element in [1] This subarray is trivially sorted Therefore, the loop invariant holds prior to the first iteration of the loop MAT-72006 AADS, Fall 2015 27-Aug-15 35 Maintenance Each iteration maintains the loop invariant: The body of the for loop works by moving 1], 2], 3], by one position to the right until the proper position for is found (lines 4 7) At this point the value of ] is inserted (line 8) The subarray [1.. ] then consists of the elements originally in [1..], but in sorted order MAT-72006 AADS, Fall 2015 27-Aug-15 36 18
Termination The condition causing the for loop to terminate is that Because each loop iteration increases by 1, we must have +1at that time Substituting +1for j in the wording of loop invariant, we have that the subarray [1.. ] consists of the elements originally in [1..], but in sorted order [1.. ]is the entire array MAT-72006 AADS, Fall 2015 27-Aug-15 37 Analysis of insertion sort The time taken by the INSERTION-SORT depends on the input: sorting a thousand numbers takes longer than sorting three numbers Moreover, the procedure can take different amounts of time to sort two input sequences of the same size depending on how nearly sorted they already are MAT-72006 AADS, Fall 2015 27-Aug-15 38 19
Input size The time taken by an algorithm grows with the size of the input Traditional to describe the running time of a program as a function of the size of its input For many problems, such as sorting most natural measure for input size is the number of items in the input i.e., the array size MAT-72006 AADS, Fall 2015 27-Aug-15 39 For, e.g., multiplying two integers, the best measure is the total number of bits needed to represent the input in binary notation Sometimes, more appropriate to describe the size with two numbers rather than one E.g., if the input to an algorithm is a graph, the input size can be described by the numbers of vertices and edges in it MAT-72006 AADS, Fall 2015 27-Aug-15 40 20
Running time Running time of an algorithm on an input: The number of primitive operations ( steps ) executed Step as machine-independent as possible For the moment: Constant amount of time to execute each line of pseudocode We assume that each execution of the th line takes time, where is a constant MAT-72006 AADS, Fall 2015 27-Aug-15 41 INSERTION-SORT) cost times 1 for 2to 1 2 ] 2 1 3 // Insert into the sorted sequence [1..] 0 1 4 4 1 5 while >0and ] > 5 6 + 1] ] 6 1) 7 7 1) 8 + 1] 8 1 MAT-72006 AADS, Fall 2015 27-Aug-15 42 21
denotes the number of times the while loop test in line 5 is executed for that value of When a for or while loop exits in the usual way, the test is executed one time more than the loop body Comments are not executable statements, so they take no time Running time of the algorithm is the sum of those for each statement executed MAT-72006 AADS, Fall 2015 27-Aug-15 43 To compute ), the running time of INSERTION-SORT on an input of values, we sum the products of the cost and times columns, obtaining 1 + 1 + 1) 1 + 1) MAT-72006 AADS, Fall 2015 27-Aug-15 44 22
Best case The best case occurs if the array is already sorted For each = 2,3,,, we then nd that ] < in line 5 when has its initial value of 1 Thus =1for = 2,3,,, and the best-case running time is 1 + 1 + 1 + 1 ) MAT-72006 AADS, Fall 2015 27-Aug-15 45 Worst case We can express this as for constants and that depend on the statement costs It is a linear function of The worst case results when the array is in reverse sorted order in decreasing order We must compare each element ] with each element in the entire sorted subarray [1.. 1], and so for = 2,3,, MAT-72006 AADS, Fall 2015 27-Aug-15 46 23
Note that and = 1 = + 1) 2 1 1) 2 by the summation of an arithmetic series + 1) = 2 MAT-72006 AADS, Fall 2015 27-Aug-15 47 The worst-case running time of INSERTION- SORT is 1 + 1 + 1) 1) 1 + 2 2 1) 2 1) = 2 + 2 + 2 + ( ++ ) MAT-72006 AADS, Fall 2015 27-Aug-15 48 24
We can express this worst-case running time as 2 for constants,, and that depend on the statement costs It is a quadratic function of The rate of growth, or order of growth, of the running time really interests us We consider only the leading term of a formula ( 2 ); the lower-order terms are relatively insignicant for large values of MAT-72006 AADS, Fall 2015 27-Aug-15 49 We also ignore the leading term s coefcient, constant factors are less signicant than the rate of growth in determining computational efciency for large inputs For insertion sort, we are left with the factor of 2 from the leading term We write that insertion sort has a worst-case running time of 2 ) ( theta of -squared ) MAT-72006 AADS, Fall 2015 27-Aug-15 50 25
2.3 Designing algorithms Insertion sort is an incremental approach: having sorted [1.. 1], we insert ] into its proper place, yielding sorted subarray [1.. ] Let us examine an alternative design approach, known as divide-and-conquer We design a sorting algorithm whose worst-case running time is much lower The running times of divide-and-conquer algorithms are often easily determined MAT-72006 AADS, Fall 2015 27-Aug-15 51 The divide-and-conquer approach Many useful algorithms are recursive: to solve a problem, they call themselves to deal with closely related subproblems These algorithms typically follow a divideand-conquer approach: Break the problem into subproblems that resemble the original problem but are smaller, Solve the subproblems recursively, Combine these solutions to create a solution to the original problem MAT-72006 AADS, Fall 2015 27-Aug-15 52 26
The paradigm involves three steps at each level of the recursion: 1. Divide the problem into a number of subproblems that are smaller instances of the same problem 2. Conquer the subproblems by solving them recursively If the sizes are small enough, just solve the subproblems in a straightforward manner 3. Combine the solutions to the subproblems into the solution for the original problem MAT-72006 AADS, Fall 2015 27-Aug-15 53 The merge sort algorithm Divide: Divide the -element sequence into two subsequences of /2 elements each Conquer: Sort the two subsequences recursively using merge sort Combine: Merge the two sorted subsequences to produce the sorted answer Recursion bottoms out when the sequence to be sorted has length 1: a sequence of length 1 is already in sorted order MAT-72006 AADS, Fall 2015 27-Aug-15 54 27
The key operation is the merging of two sorted sequences in the combine step We call auxiliary procedure MERGE), where is an array and,, and are indices such that The procedure assumes that the subarrays.. and + 1.. ] are in sorted order and merges them to form a single sorted subarray that replaces the current subarray..] MAT-72006 AADS, Fall 2015 27-Aug-15 55 MERGE) 1. 1 +1 2. 2 3. Let [1.. 1 + 1] and [1.. 2 + 1]be new arrays 4. for 1to 1 5. 1] 6. for 1to 2 7. 8. 1 +1] 9. 2 +1] 10. 1 11. 1 12. for to 13. if 14. ] 15. +1 16. else ] 17. +1 MAT-72006 AADS, Fall 2015 27-Aug-15 56 28
Line 1 computes the length 1 of the subarray..];similarly for 2 and +1..]on line 2 Line 3 creates arrays (left) and (right), of lengths 1 +1and 2 +1, respectively the extra position will hold the sentinel The for loop of lines 4 5 copies.. into [1.. ]; Lines 6 7 copy + 1.. into [1.. 2 ] Lines 8 9 put the sentinels at the ends of and MAT-72006 AADS, Fall 2015 27-Aug-15 57 Lines 10 17 perform the +1basic steps by maintaining the following loop invariant: At the start of each iteration of the for loop of lines 12 17,.. 1]contains the smallest elements of [1.. 1 + 1]and [1.. 2 + 1], in sorted order Moreover, ] and ] are the smallest elements of their arrays that have not been copied back into MAT-72006 AADS, Fall 2015 27-Aug-15 58 29
MERGE(, 9, 12, 16) 8 9 10 11 12 13 14 15 16 17 2 4 5 7 1 2 3 6 1 2 3 4 5 2 4 5 7 1 2 3 4 5 1 2 3 6 8 9 10 11 12 13 14 15 16 17 1 4 5 7 1 2 3 6 1 2 3 4 5 2 4 5 7 1 2 3 4 5 1 2 3 6 MAT-72006 AADS, Fall 2015 27-Aug-15 59 8 9 10 11 12 13 14 15 16 17 1 2 5 7 1 2 3 6 1 2 3 4 5 2 4 5 7 1 2 3 4 5 1 2 3 6 8 9 10 11 12 13 14 15 16 17 1 2 2 7 1 2 3 6 1 2 3 4 5 2 4 5 7 1 2 3 4 5 1 2 3 6 MAT-72006 AADS, Fall 2015 27-Aug-15 60 30
8 9 10 11 12 13 14 15 16 17 1 2 2 3 1 2 3 6 1 2 3 4 5 2 4 5 7 i 1 2 3 4 5 1 2 3 6 8 9 10 11 12 13 14 15 16 17 1 2 2 3 4 2 3 6 1 2 3 4 5 2 4 5 7 1 2 3 4 5 1 2 3 6 MAT-72006 AADS, Fall 2015 27-Aug-15 61 8 9 10 11 12 13 14 15 16 17 1 2 2 3 4 5 3 6 1 2 3 4 5 2 4 5 7 1 2 3 4 5 1 2 3 6 8 9 10 11 12 13 14 15 16 17 1 2 2 3 4 5 6 6 1 2 3 4 5 2 4 5 7 1 2 3 4 5 1 2 3 6 MAT-72006 AADS, Fall 2015 27-Aug-15 62 31
8 9 10 11 12 13 14 15 16 17 1 2 2 3 4 5 6 7 1 2 3 4 5 2 4 5 7 1 2 3 4 5 1 2 3 6 The needed +1iterations of the last for loop have been executed: [9.. 16] is sorted, and the two sentinels in and are the only two elements in these arrays that have not been copied into MAT-72006 AADS, Fall 2015 27-Aug-15 63 MERGE procedure runs in ) time, where +1 each of lines 1 3 and 8 11 takes constant time the for loops of lines 4 7 take 1 2 = ) time there are iterations of the for loop of lines 12 17, each of which takes constant time MAT-72006 AADS, Fall 2015 27-Aug-15 64 32
Merge sort The procedure MERGE-SORT sorts the elements in.. If, the subarray has at most one element and is therefore already sorted Otherwise, the divide step computes an index that partitions.. into two subarrays:..], containing /2elements +1..],containing /2elements MAT-72006 AADS, Fall 2015 27-Aug-15 65 MERGE-SORT) 1. if 2. )/2 3. MERGE-SORT) 4. MERGE-SORT+1,) 5. MERGE) MAT-72006 AADS, Fall 2015 27-Aug-15 66 33
Analysis of merge sort Our analysis assumes that the original problem size is a power of 2 Each divide step then yields two subsequences of size exactly /2 We set up the recurrence for ), the worstcase running time of merge sort on numbers Merge sort on just one element takes constant time MAT-72006 AADS, Fall 2015 27-Aug-15 67 When we have >1elements, we break down the running time as follows: Divide: The step just computes the middle of the subarray, which takes constant time: ) = (1) Conquer: We recursively solve two subproblems, each of size /2, which contributes /2) to the running time Combine: the MERGE procedure on an element array takes time ), and so ) = ) MAT-72006 AADS, Fall 2015 27-Aug-15 68 34
When we add the ) and ), we are adding functions that are ) and (1) This sum is a linear function of Adding it to the /2) term from the conquer step gives the recurrence for ): (1) = 2) + ) if=1 if>1 MAT-72006 AADS, Fall 2015 27-Aug-15 69 To intuitively see that the solution to the recurrence is ) = lg),where lg stands for log 2, let us rewrite it as if=1 = 2) + if>1 where constant represents time required to solve problems of size 1 and that per array element of the divide and combine steps MAT-72006 AADS, Fall 2015 27-Aug-15 70 35
MAT-72006 AADS, Fall 2015 27-Aug-15 71 36