Advanced Algorithms Class Notes for Monday, November 10, 2014

Size: px

Start display at page:

Download "Advanced Algorithms Class Notes for Monday, November 10, 2014"

Poppy McGee
5 years ago
Views:

1 Advanced Algorithms Class Notes for Monday, November 10, 2014 Bernard Moret Divide-and-Conquer: Matrix Multiplication Divide-and-conquer is especially useful in computational geometry, but also in numerical algebra. One of the most celebrated and influential algorithms in algebra is the Fast Fourier Transform (FFT) algorithm: it is a typical use of divide-and-conquer. Here we look at another famous algorithm: Strassen s algorithm for fast matrix multiplication. The origin of this algorithm goes back to the days when multiplying two values (not matrices, just scalars) was much more expensive than adding them so that programmers and algorithm designers strove to avoid multiplications. Strassen s algorithm started with an attempt at reducing the number of scalar multiplications needed to compute the product of two 2 2 matrices: ( a11 a 12 a 21 a 22 ) ( b11 b 12 b 21 b 22 ) ( c11 c = 12 c 21 c 22 Computed in the standard manner, with each entry of the product computed separately, this product requires 2 scalar multiplications for each entry, or 8 in all. Strassen found a way to compute 7 subexpressions, each using only one multiplication, such that all 8 entries can then be computed from these subexpressions using only additions and multiplications. The subexpressions are not obvious: Strassen used the following: a 11 b 11 (a 12 a 21 + a 11 a 22 )b 22 a 12 b 21 a 22 (b 11 + b 22 b 12 b 21 ) (a 11 a 21 )(b 22 b 12 ) (a 21 + a 22 a 11 )(b 22 b 12 + b 11 ) (a 21 + a 22 )(b 12 b 11 ) Using these, one can compute the product of two 2 2 matrices with 7 multiplications and 15 additions and subtractions, instead of 8 multiplications and 4 additions and subtractions. Today, of course, multiplication is almost as fast as addition for standard number representation (single or double precision numbers) although it remains much slower than addition for unbounded-precision arithmetic. Thus Strassen s find would be little more than a footnote in the history of computing if he had not quickly realized that his algorithm could be used recursively in a divide-and-conquer scheme. As stated, his algorithm computes the product of 2 2 matrices by reducing the number of products of 1 1 matrices (scalars); by using this approach recursively, we can take 4 4 matrices, divide each into four 2 2 matrices, and compute the product of 4 4 matrices using 7 products (and 15 additions/subtractions) of 2 2 matrices. When reduced to multiplications and additions of scalars, this yields a total of 7 7=49 scalar multiplications and(7 15)+(15 4)= 165 additions and subtractions, whereas the standard method would need = 64 scalar ) 1

2 multiplications and = 48 additions. If we move to 8 8 matrices, we now get 7 7 7=343 scalar multiplications instead of 512, and (7 165)+(15 16) = 1395 additions and subtractions instead of = 448. Notice that the ratio of excess additions and subtractions decreases steadily: the standard method requires (n 1) n 2, or Θ(n 3 ) additions, whereas Strassen s algorithm uses f(n)=7 f(n/2)+15n 2 /4 additions and subtractions, which yields f(n)= Θ(n log 2 7 ). In other words, the penalty sustained on 2 2 matrices in terms of additions and subtractions is quickly overshadowed by the fact that we have fewer matrix multiplications to perform, thereby reducing the number of both multiplicative and additive scalar operations. In terms of scalar multiplications, the standard method uses exactly n 3 of them, where Strassen s uses f(n)=7 f(n/2) in all, which is again Θ(n log 2 7 ). Hence the standard method computes the product of two n n matrices in Θ(n 3 ) time, whereas, using divide-and-conquer and a clever way to handle matrices divided into four equal submatrices, Strassen s algorithm takes O(n log 2 7 ) time, a substantial improvement. Strassen s algorithm is by no means the fastest matrix multiplication algorithm in asymptotic terms, but it is the fastest practical algorithm. The Copper-Winograd algorithm, inspired from Strassen s algorithm, but far more complex, runs (in some of its variations) in O(n 2.38 ), but its coefficients are so large that the algorithm remains impractical today. We just add a couple of other examples of divide-and-conquer for numerical programming. We can compute Fibonacci numbers exactly in logarithmic time by considering products of Fibonacci numbers for arguments divided by 2 (obviously, using floating-point numbers, we can compute in constant time by using the irrational function that solves the recurrence, but we get numerical errors for large values). We can multiply two polynomials of degree n in Θ(n log 2 3 ) time (instead of Θ(n 2 ) by the obvious method) by devising a way to compute (a 1 n+a 0 ) (b 1 n+b 0 ) with only three multiplications instead of four this is a direct analog of Strassen s algorithm. (In this case, we can compute the product even faster using FFTs, since it takes Θ(n log n) to compute the forward FFT, linear time to add the two transformed functions in Fourier space, then again Θ(nlog n) to reverse the FFT. However, FFTs require non-integral operations, whereas the divide-and-conquer uses only integral operations.) Dynamic Programming Dynamic programming is the last main non-randomized algorithmic design method. It can be viewed simply as a lazy evaluation approach, or as a generalization of divide-andconquer. The first applies to simple computations, for instance, computing Fibonacci numbers by storing the value returned by a recursive call and making such calls only when the desired value is not already stored. The second is more to the point: divide-and-conquer, 2

3 as we have seen, can always divide in any manner without affecting the correctness or optimality of the answer returned the choice of how to divide affects only the running time, not the answer. Whenever different divisions could return different answers, we must turn to another algorithmic paradigm and in such cases dynamic programming usually works. Trivial DPs: lazy evaluation Consider our old friends the Fibonacci numbers. If we run the code int function Fib(int n) int answer if (n==0) or (n==1) answer = 1 else answer = Fib(n-1) + Fib(n-2) return answer then the running time of our procedure is exponential! That is because it will compute the function over and over again for the same argument, within different branches of the recursion tree. (In fact, the running time is Θ(Fib(n)).) If, however, we use a global array initialized to a recognizable marker value, then we can avoid recomputing values as follows: int array storefib[0..n] storefib[0] = 1 storefib[1] = 1 for i=2 to n storefib[i] = -1 int function Fib(int n) if (storefib[n] == -1) storefib[n] = Fib(n-1) + Fib(n-2) return storefib[n] This new version runs in linear time. Obviously, it is much simpler and more efficient to use a direct loop to fill in the array, and even better to rotate values among three variables and so reduce the storage requirements, but these are minor issues compared to the gulf between exponential and linear time that was bridged by the simple expedient of storing the solutions to subproblems and looking them up whenever possible. Note, however, that such an approach relies on the fundamental premise that the number of subinstances is a polynomial function of the size of the instance. (For Fibonacci, this is a linear function: n subinstances when solving the instance F(n).) This is an absolute requirement for using dynamic programming, as otherwise both storage and running time will explode. We shall now discover a second requirement: when we must compute an optimal solution according to some given objective function (as opposed to a simple value, such as F(n)), the objective function for the full instance must be computable from the values of the objective function for the subinstances. 3

4 Matrix chain products A classic problem in dynamic programming is the so-called matrix chain product. Suppose you are to multiply many matrices in a single product: B= A i i where each matrix is rectangular and the dimensions all match that is, if matrix A I is m n, then matrix A i+1 is necessarily n k for some k. Because the matrices are rectangular, we cannot use fancy algorithms such as Strassen s: to compute the product A i a i+1 just described, producing a m k matrix, will take O(mnk) time, using the basic method of computing each element of the product independently by taking the scalar product of the corresponding row of the first matrix and column of the second matrix. Because matrix multiplication is associative, however, the order in which the various matrices are multiplied does not affect the answer, yet may have a strong effect on the running time. Consider the product of four matrices X = A B C D We can parenthesize this product as ((AB)C)D and the cost is then the cost of the product AB, or = , followed by the cost of multiplying the result by C, for an additional cost of = , followed by the cost of multiplying the result by D, for an additional cost of = and thus a total cost of In contrast, we could have parenthesized the product as A((BC)D) for a total cost of just = , at just one-hundredth of the cost! Thus the order of association (the parenthesization) matters very much. Given a long chain of matrix products of that type, then, what is the optimal parenthesization (where the cost is just the sum of the cost of each successive multiplication and the cost of the multiplication is just the product of the three dimensions of the two matrices)? If you look at the outermost layer of parentheses, they divide the expression into two sub-expressions, each of which defines the product of a certain number of matrices; the cost of these two sub-expressions can be computed recursively, while combining the two is a simple matrix product. Unfortunately, we do not know where to place the outermost pair of parentheses in our example, placing them to define(abc)d gave us a much worse result than placing them to define A(BCD). This is a classic situation for dynamic programming: we shall simply try all possible placements of this first pair. The number of choices is fortunately limited by the fact that, while the product is associative, it is not commutative the ordering of the matrices in the product is fixed. The outermost pair of parentheses defines the last matrix multiplication to be run and that multiplication can be chosen from just n 1 possibilities for chain of n matrices (from a 1 (2 n) split to a (1 n 1) n split). The cost of the split (1 i) (i+1 n) is the cost of the optimal way to compute the product of the first i matrices, plus the cost of the optimal way to compute the product of the last n i matrices, plus the cost of multiplying together the two resulting matrices, and we seek the choice of i that will minimize this expression. Now 4

5 this is just the first level; in order to compute the optimal way to compute the product of the first i matrices, we will seek a j that gives us an optimal decomposition into a product of the first j matrices, a product of the remaining i j matrices, and the matrix product of the two results. Thus, in general, we will need to find (and store) the optimal way to compute the product of the chain of matrices starting with matrix A i and ending with matrix A i+k, for all valid indices 1 i i+k n. There are Θ(n 2 ) such pairs and computing the optimal value for any of them involves finding the minimum value among k choices, hence the algorithm will require Θ(n 2 ) space and Θ(n 3 ) time. DP in general Obviously, a dynamic program is much more computationally demanding than a divideand-conquer algorithm: the divide-and-conquer algorithm does not test all possible ways of dividing a subproblem, but chooses one a priori (such as choosing the middle one, for instance), whereas the dynamic program must test every choice, in what amounts to a recursive setup (to compute the value of a choice, we must find out the optimal way to choose roots for its subtrees, and so forth all the way down). That the dynamic program is even able to run in polynomial time is due to a combination of two crucial factors. We have already remarked on the first: the number of subproblems the size of the array to be filled in is polynomial. The second is the computation of the value of an optimal solution to a subproblem: it requires only knowledge of the optimal solutions to smaller subproblems and can be computed in polynomial time using that knowledge. Another way to put it is that we can set up a recurrence relation to describe the optimal cost of a subproblem, a recurrence that depends only on entries for smaller subproblems; it is sometimes known in the literature as the principle of optimality. The principle of optimality by itself enables us to devise a dynamic program for an optimization problem, but in order for that algorithm to be useful, we also need to know that the number of subproblems is polynomially bounded. For instance, it is quite easy to devise a dynamic program for the Travelling Salesperson problem, but that DP will require exponential storage and take exponential time, because it will have an exponential number of subproblems. Sequence alignment Sequence alignment and its many variants have become the computational mainstay of research in biology, in part because of the advent of inexpensive, high-throughput sequencing machines: life sciences researchers today sequence anything at hand, from indiscriminate samplings of environments (gut flora, seawater, dirt, etc.) to understand microbial communities to targeted tissues to understand metabolism, aging, disease, etc. Much of what is done with sequences from a computational point of view predates the advent of DNA sequencing: computer scientists had already investigated many of the same problems in the context of strings, editors, spellcheckers, etc. The most basic purpose of sequence alignment is the computation of a similarity (or distance) measure between pairs of sequences. One such measure, the Levenshtein dis- 5

6 tance, was published (in Russian) in 1965 by Russian scientist Levenshtein, and corresponds pretty much exactly to today s sequence alignment. The Levenshtein distance is an example of an edit distance: given two structures from some family and given a collection of allowable operations on these structures, find the shortest (or least-cost) series of operations that will transform one of the given structures into the other. The operations are editing operations, their name directly suggesting their provenance from string operations. Levenshtein defined his metric in terms of single-character editing: the allowable operations are substitutions (or one character for another) and so-called indels, an abbreviation for single-character insertions and deletions. (However, his paper also discusses reversals as another operation.) The alignment version of this edit distance definition is simply the layout, in an indexed array, of the two strings, establishing common indexing that aligns the two strings, allowing gaps (empty array entries) in one string where an indel is indicated, and putting into 1-1 correspondence identical characters as well as characters that are the result of a substitution (biologists would call that a mutation ). For instance, consider the two strings ACCATT and ACATA; the edit distance is 2: we need to delete/insert a C, and substitute an A for a T. In terms of alignment, we would create the following pairwise alignment: A C C A T T A C A T A where the dash was used to indicate a gap corresponding to an indel, where the last position shows the substitution between A and T, and where all other positions are exactly matched. Many such edit distances have been proposed in Computer Science. The Damerau- Levenshtein distance, published in 1964 in the US by Damerau, adds one more operation, an exchange of adjacent characters a common problem in typing and hence a reasonable edit operation. The older Hamming distance (introduced in 1950 by Hamming) removes the indels, allowing only substitutions. All of these distances can be fine-tuned to the application by giving different costs to the specific substitutions, indels, and exchanges; for instance, typing substitutions depend directly on the language and the arrangement of characters on the keyboard. Naturally, computer scientists developped algorithms to compute these various edit distances, with or without weights on the operations. For two strings of length m and n, the Levenshtein and Damerau-Levenshtein distances can be computed in O(mn) time the full algorithm is generally credited to Lowrance and Wagner (1975); the Hamming distance, of course, is trivially computed in linear time, since it assumes that the two strings have the same length and thus does not need to align them, only to compare the character pairs at each index. In computational biology, these algorithms were re-invented and, in two slightly different variations, go by the name of Needleman-Wunsch (1970), which is exactly the Levenshtein problem, known in biology as global pairwise sequence alignment, and Smith-Waterman (1981), which solves a variant of the previous known as local pairwise sequence alignment. 6

Framework for Design of Dynamic Programming Algorithms

CSE 441T/541T Advanced Algorithms September 22, 2010 Framework for Design of Dynamic Programming Algorithms Dynamic programming algorithms for combinatorial optimization generalize the strategy we studied