June 9, 214
DP: Longest common subsequence biologists often need to find out how similar are 2 DNA sequences DNA sequences are strings of bases: A, C, T and G how to define similarity?
DP: Longest common subsequence biologists often need to find out how similar are 2 DNA sequences DNA sequences are strings of bases: A, C, T and G how to define similarity? one is a substring of another
DP: Longest common subsequence biologists often need to find out how similar are 2 DNA sequences DNA sequences are strings of bases: A, C, T and G how to define similarity? one is a substring of another number of changes (mutations) needed to change one string to another
DP: Longest common subsequence biologists often need to find out how similar are 2 DNA sequences DNA sequences are strings of bases: A, C, T and G how to define similarity? one is a substring of another number of changes (mutations) needed to change one string to another the longest common subsequence of two strings S 1 and S 2 : a longest sequence S 3 appearing in each of S 1 and S 2 (in the same order, but necessarily consecutively) Definition. Z = z 1 z 2... z k is a subsequence of S = s 1 s 2... s n if there exists an increasing sequence of indexes: 1 i 1 < i 2 < < i k n such that z j = s ij
Example. S= G G C A C T G T A C Z= G C C A Z = GCCA is a subsequence of S = GGCACTGTAC Definition. Z is a common subsequence of X and Y if it is a subsequence of both X and Y. A longest such Z is called a longest common subsequence LCS.
Example. S= G G C A C T G T A C Z= G C C A Z = GCCA is a subsequence of S = GGCACTGTAC Definition. Z is a common subsequence of X and Y if it is a subsequence of both X and Y. A longest such Z is called a longest common subsequence LCS. Example. Consider X = GGCACTGTAC Y = CATGTCACGG Then ATAC and GCAG are a common subsequences of X and Y. The longest common subsequence is CATGTAC.
brute-force approach : list all subsequences of X and for each test if it s subsequence of Y If X has a length m, there are 2 m subsequences of X exponential time.
brute-force approach : list all subsequences of X and for each test if it s subsequence of Y If X has a length m, there are 2 m subsequences of X exponential time. We should apply dynamic programming approach.
Optimal substructure of LCS Claim. Let Z = z 1... z k be a LCS of X = x 1... x m and Y = y 1... y n. Then 1 if x m = y n, then z k = x m = y n and Z 1,k 1 is an LCS of X 1,m 1 and Y 1,n 1 ; 2 if x m y n and z k x m, then Z is an LCS of X 1,m 1 and Y ; 3 if x m y n and z k y n, then Z is an LCS of X and Y 1,n 1.
Proof. 1 if z k x m = y n, then Zx m is a common subsequence of X and Y longer than Z, a contradiction clearly, Z 1,k 1 is a common subsequence of X 1,m 1 and Y 1,n 1 if not a longest one: let W be an LCS of X 1,m 1 and Y 1,n 1 ; then Wz k is a common subsequence of X and Y, again a contradiction ( cut-and-paste ) 2 clearly, since z k x m, Z is a common subsequence of X 1,m 1 and Y ; if not a longest one: use cut-and-paste technique again 3 similarly as in case 2. Hence, an LCS of two sequences contains within it an LCS of prefixes of these two sequences: optimal substructure property.
A recursive solution To find an LCS of X = x 1... x m and Y = y 1... y n : if x m = y n, then find an LCS of X 1,m 1 and Y 1,n 1 and append x m = y n to it if x m y n, then find an LCS of X and Y 1,n 1 and an LCS of X 1,m 1 and Y, and take the longer of these two
A recursive solution To find an LCS of X = x 1... x m and Y = y 1... y n : if x m = y n, then find an LCS of X 1,m 1 and Y 1,n 1 and append x m = y n to it if x m y n, then find an LCS of X and Y 1,n 1 and an LCS of X 1,m 1 and Y, and take the longer of these two Let c[i, j] be the length of an LCS of X 1,i and Y 1,j recursive formula: if i = or j =, c[i, j] = c[i 1, j 1] + 1 if i, j > and x i = y j, max(c[i, j 1], c[i 1, j]) if i, j > and x i y j.
Computing A recursive algorithm based on recursive formula would be exponential, however there are only (m + 1)(n + 1) subproblems ( overlapping-subproblems property ) entries of table c[... m,... n] are filled in row-major order : the first row from left to right the second row from left to right etc table b[1... m, 1... n] - contains the information to construct the optimal solution (shows a direction from where we got the minimal value of the length of an LCS: c[i, j] = c[i, j 1], c[i, j] = c[i 1, j], or c[i, j] = c[i 1, j 1] + 1.
LCS-Length (X, Y ) 1. m := length[x ] 2. n := length[y ] 3. for i := 1 to m c[i, ] := 4. for i := 1 to n c[, i] := 5. for i := 1 to m 6. for j := 1 to n 7. if x i = y j 8. c[i, j] := c[i 1, j 1] + 1 9. b[i, j] := 1. if c[i 1, j] c[i, j 1] 11. c[i, j] := c[i 1, j] 12. b[i, j] := 13. else c[i, j] := c[i 1, j] 14. b[i, j] := 15. return c and b Time complexity: O(mn)
j 1 2 3 4 5 6 7 8 9 1 i G G C A C T G T A C 1 C,,,1,1,1,1,1,1,1,1 2 A,,,1,2,2,2,2,2,2,2 3 T,,,1,2,2,3,3,3,3,3 4 G,1,1,1,2,2,3,4,4,4,4 5 T,1,1,1,2,2,3,4,5,5,5 6 C,1,1,2,2,3,3,4,5,5,6 7 A,1,1,2,3,3,3,4,5,6,6 8 C,1,1,2,3,4,4,4,5,6,7 9 G,1,2,2,3,4,4,5,5,6,7 1 G,1,2,2,3,4,4,5,5,6,7
1 2 3 4 5 6 C T G A C A y i x i 1 A 1 1 1 2 C 1 1 1 1 2 2 3 G 1 1 2 2 2 2 4 C 1 1 2 2 3 3 5 T 1 2 2 2 3 3 6 A 1 2 2 3 3 4 7 C 1 2 2 3 4 4
1 2 3 4 5 6 C T G A C A y i 1 2 3 4 5 x i A C G C T 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 1 1 2 2 3 3 1 2 2 2 3 3 PRING-LCS (b,x,y,i,j) 1. if i = or j = then return 2. if b[i, j] = then PRINT-LCS (b,x,y,i-1,j-1) print x i 3. else if b[i, j] = then PRINT-LCS (b,x,y,i-1,j) 6 A 1 2 2 3 3 4 4. else PRINT-LCS(b,X,Y,i,j-1) 7 C 1 2 2 3 4 4
Exercise Matrix Multiplications Given: a chain of matrices (A 1, A 2,... A n ), with A i having dimension p i 1 p i. Goal: compute the product A 1 A 2 A n as fast as possible
Exercise Matrix Multiplications Given: a chain of matrices (A 1, A 2,... A n ), with A i having dimension p i 1 p i. Goal: compute the product A 1 A 2 A n as fast as possible Clearly, time to multiply two matrices depends on dimensions Does the order of multiplication (= parenthesization) matter? Example: n = 4. Possible orders: (A 1 (A 2 (A 3 A 4 ))) (A 1 ((A 2 A 3 )A 4 )) ((A 1 A 2 )(A 3 A 4 )) ((A 1 (A 2 A 3 ))A 4 ) (((A 1 A 2 )A 3 )A 4 )
Suppose A 1 is 1 1, A 2 is 1 5, A 3 is 5 5, and A 4 is 5 1 Assume that multiplication of a (p q)-matrix and a (q r)-matrix takes pqr steps (a straightforward algorithm)
Suppose A 1 is 1 1, A 2 is 1 5, A 3 is 5 5, and A 4 is 5 1 Assume that multiplication of a (p q)-matrix and a (q r)-matrix takes pqr steps (a straightforward algorithm) Order 2: (A 1 ((A 2 A 3 )A 4 )) 1 5 5 + 1 5 1 + 1 1 1 = 85, Order 5: (((A 1 A 2 )A 3 )A 4 ) 1 1 5 + 1 5 5 + 1 5 1 = 12, 5 Seems it might be a good idea to find a good order
How many orders are there? Can we just check all of them? ( we look only at fully parenthesized matrix products)
How many orders are there? Can we just check all of them? ( we look only at fully parenthesized matrix products) Let P(n) be the number of orders of a sequence of n matrices Clear, P(1) = 1 (only one matrix)
How many orders are there? Can we just check all of them? ( we look only at fully parenthesized matrix products) Let P(n) be the number of orders of a sequence of n matrices Clear, P(1) = 1 (only one matrix) If n 2, a matrix product is the product of two matrix sub-products. Split may occur between k-th and (k + 1)-st position, for any k = 1, 2,..., n 1 ( top-level multiplication )
How many orders are there? Can we just check all of them? ( we look only at fully parenthesized matrix products) Let P(n) be the number of orders of a sequence of n matrices Clear, P(1) = 1 (only one matrix) If n 2, a matrix product is the product of two matrix sub-products. Split may occur between k-th and (k + 1)-st position, for any k = 1, 2,..., n 1 ( top-level multiplication ) Thus { 1 if n = 1 P(n) = n 1 k=1 P(k) P(n k) if n 2 Unfortunately, P(n) = Ω(4 n /n 3/2 ), and thus (easier to see) P(n) = Ω(2 n ) Thus brute-force approach (check all parenthesization) is no good
We will use the Dynamic programming approach to optimally solve this problem. The four basic steps when designing Dynamic programming algorithm: 1 Characterize the structure of an optimal solution 2 Recursively define the value of an optimal solution 3 Compute the value of an optimal solution in a bottom-up fashion 4 Construct an optimal solution from computed information
1. Characterizing structure Let A i,j = A i A j for i j. If i < j, then any parenthesization of A i,j must split product at some k, i k < j, i.e., compute A i,k, A k+1,j, and then A i,k A k+1,j.
1. Characterizing structure Let A i,j = A i A j for i j. If i < j, then any parenthesization of A i,j must split product at some k, i k < j, i.e., compute A i,k, A k+1,j, and then A i,k A k+1,j. Hence, for some k, the cost of computing A i,j is the cost of computing A i,k plus the cost of computing A k+1,j plus the cost of multiplying A i,k and A k+1,j.
Optimal substructure: Suppose that optimal parenthesization of A i,j splits the product between A k and A k+1. Then, parenthesizations of A i,k and A k+1,j within this optimal parenthesization must be also optimal (otherwise, substitute the opt. parenthesization of A i,k (resp. A k+1,j ) to current parenthesization of A i,j and obtain a better solution contradiction) Use optimal substructure to construct an optimal solution: 1 split into two subproblems (choosing an optimal split), 2 find optimal solutions to subproblem, 3 combine optimal subproblem solutions.
A recursive solution Let m[i, j] denote minimum number of multiplications needed to compute A i,j = A i A i+1 A j (full problem: m[1, n]). Recursive definition of m[i, j]: if i = j, then m[i, j] = m[i, i] = (no multiplication needed)
A recursive solution Let m[i, j] denote minimum number of multiplications needed to compute A i,j = A i A i+1 A j (full problem: m[1, n]). Recursive definition of m[i, j]: if i = j, then m[i, j] = m[i, i] = (no multiplication needed) if i < j, assume optimal split at k, i k < j. Since each matrix A i is p i 1 p i, A i,k is p i 1 p k and A k+1,j is p k p j, m[i, j] = m[i, k] + m[k + 1, j] + p i 1 p k p j
A recursive solution Let m[i, j] denote minimum number of multiplications needed to compute A i,j = A i A i+1 A j (full problem: m[1, n]). Recursive definition of m[i, j]: if i = j, then m[i, j] = m[i, i] = (no multiplication needed) if i < j, assume optimal split at k, i k < j. Since each matrix A i is p i 1 p i, A i,k is p i 1 p k and A k+1,j is p k p j, m[i, j] = m[i, k] + m[k + 1, j] + p i 1 p k p j We do not know optimal value of k. There are j i possibilities, k = i, i + 1,..., j 1, hence if i = j m[i, j] = min i k<j {m[i, k] + m[k + 1, j] if i < j +p i 1 p k p j } We also keep track of optimal splits: s[i, j] = k m[i, j] = m[i, k] + m[k + 1, j] + p i 1 p k p j (s[i, j] is a value of k at which we split Dynamic the Programming product II A i,j to obtain
Computing the optimal costs Want to compute m[1, n], minimum cost for multiplying A 1 A 2 A n. Recursively, it would take Ω(2 n ) steps: the same subproblems are computed over and over again. However, if we compute in a bottom-up fashion, we can reduce running time to polynomial in n.
Computing the optimal costs Want to compute m[1, n], minimum cost for multiplying A 1 A 2 A n. Recursively, it would take Ω(2 n ) steps: the same subproblems are computed over and over again. However, if we compute in a bottom-up fashion, we can reduce running time to polynomial in n. The recursive equation shows that cost m[i, j] (product of j i + 1 matrices) depends only on smaller subproblems: for k = 1,..., j 1, A i,k is a product of k i + 1 < j i + 1 matrices, A k+1,j is a product of j k < j i + 1 matrices. Algorithm should fill table m in order of increasing lengths of chains.
Matrix-Chain-Order(p) 1. n := length[p] 1 2. for i := 1 to n 3. m[i, i] := 4. for l := 2 to n 4. for i := 1 to n l + 1 5. j := i + l 1 m[i, j] := 6. for k := i to j 1 7. q := m[i, k] + m[k + 1, j] + p i 1 p k p j 8. if q < m[i, j] 9. m[i, j] := q s[i, j] := k 1. return m and s