15.4 Longest common subsequence

Size: px

Start display at page:

Download "15.4 Longest common subsequence"

Phillip Miller
6 years ago
Views:

1 15.4 Longest common subsequence Biological applications often need to compare the DNA of two (or more) different organisms A strand of DNA consists of a string of molecules called bases, where the possible bases are Adenine, Guanine, Cytosine, and Thymine We express a strand of DNA as a string over the alphabet {A,C,G,T} E.g., the DNA of two organisms may be ACCGGTCGAGTGCGCGGAAGCCGGCCGAA = GTCGTTCGGAATGCCGTTGCTCTGTAAA 353 By comparing two strands of DNA we determine how similar they are, as some measure of how closely related the two organisms are We can de ne similarity in many different ways E.g., we can say that two DNA strands are similar if one is a substring of the other Neither nor is a substring of the other Alternatively, we could say that two strands are similar if the number of changes needed to turn one into the other is small 354 1

2 Yet another way to measure the similarity of and is by nding a third strand in which the bases in appear in each of and these bases must appear in the same order, but not necessarily consecutively The longer the strand we can nd, the more similar and are In our example, the longest strand is ACCGGTCGAGTGCGCGGAAGCCGGCCGAA GTCGTTCGGAATGCCGTTGCTCTGTAAA = GTCGTCGGAAGCCGGCCGAA 355 Formalize this notion of similarity as the longestcommon-subsequence problem A subsequence is just the given sequence with zero or more elements left out Formally, given a sequence =,,, another sequence =,, is a subsequence of if there exists a strictly increasing sequence,, of indices of such that for all =1,2,,, we have For example, = is a subsequence of = with corresponding index sequence 2,3,5,

3 A sequence is a common subsequence of and if is a subsequence of both and For example, if = and =, the sequence common subsequence of both and It is not a longest common subsequence (LCS) of and The sequence is also common to both and and has length 4 This sequence is an LCS of and, as is ; and have no common subsequence of length 5 or greater is a 357 In longest-common-subsequence problem, we are given =,, and =,, and wish to nd a max-length common subsequence of and Step 1: Characterizing a longest common subsequence In a brute-force approach, we would enumerate all subsequences of and check each of them to see whether it is also a subsequence of, keeping track of the longest subsequence we nd Each subsequence of corresponds to a subset of the indices 1,2,, of Because has 2 subsequences, this approach requires exponential time, making it impractical for long sequences 358 3

4 The LCS problem has an optimal-substructure property, however, as the following theorem shows The natural classes of subproblems correspond to pairs of pre xes of the two input sequences Precisely, given a sequence =,,, we de ne the th pre x of, for = 0,1,,, as =,, For example, if =, then = and is the empty sequence 359 Theorem 15.1 (Optimal substructure of LCS) Let =,, and =,, be sequences, and let =,, be any LCS of and. 1. If, then and is an LCS of and. 2. If, then implies that is an LCS of and. 3. If, then implies that is an LCS of and

5 Proof (1) If, then we could append to to obtain a common subsequence of and of length +1, contradicting the supposition that is a LCS of and. Thus, we must have. Now, the pre x is a length- 1) common subsequence of and. We wish to show that it is an LCS. Suppose for the purpose of contradiction that there exists a common subsequence of and with length greater than 1. Then, appending to produces a common subsequence of and whose length is greater than, which is a contradiction. 361 (2) If, then is a common subsequence of and. If there were a common subsequence with length greater than, then would also be a common subsequence of and, contradicting the assumption that is an LCS of and. (3) The proof is symmetric to (2). Theorem 15.1 tells us that an LCS of two sequences contains within it an LCS of pre xes of the two sequences Thus, the LCS problem has an optimalsubstructure property 362 5

6 Step 2: A recursive solution We examine either one or two subproblems when nding an LCS of and If, we nd an LCS of and Appending yields an LCS of and If, then we (1) nd an LCS of and and (2) nd an LCS of and Whichever of these two LCSs is longer is an LCS of and These cases exhaust all possibilities, and we know that one of the optimal subproblem solutions must appear within an LCS of and 363 To nd an LCS of and, we may need to nd the LCSs of and and of and Each subproblem has the subsubproblem of nding an LCS of and Many other subproblems share subsubproblems As in the matrix-chain multiplication, recursive solution to the LCS problem involves a recurrence for the value of an optimal solution Let us de ne ] to be the length of an LCS of the sequences and If either =0or =0, one of the sequences has length 0, and so the LCS has length

7 The optimal substructure of the LCS problem gives 0 if = 0 or = 0 = 1, 1 +1 if > 0 and max 1, 1, ) if > 0 and Observe that a condition in the problem restricts which subproblems we may consider When, we consider nding an LCS of and Otherwise, we instead consider the two subproblems of nding an LCS of and and of and In the previous dynamic-programming algorithms for rod cutting and matrix-chain multiplication we ruled out no subproblems due to conditions in the problem 365 Step 3: Computing the length of an LCS Since the LCS problem has only distinct subproblems, we can use dynamic programming to compute the solutions bottom up LCS-LENGTH stores the ] values in [0..,0.. ], and it computes the entries in row-major order I.e., the procedure lls in the rst row of from left to right, then the second row, and so on The procedure also maintains the table [1..,1.. ] Intuitively, ] points to the table entry corresponding to the optimal subproblem solution chosen when computing contains the length of an LCS of and 366 7

LCS-LENGTH ) 1. 2. 3. let [1..,1.. ]and [0.., 0.. be new tables 4. for to 5.,0 0 6. for 0to 7. 0, 0 8. for to 9. for 1to 10. if 11.

8 LCS-LENGTH ) let [1..,1.. ]and [0.., 0.. be new tables 4. for to 5., for 0to 7. 0, 0 8. for to 9. for 1to 10. if 11. 1, elseif 1, 1] 14. 1, ] else 1] return and Running time: ) 367 The and tables computed by LCS-LENGTH on = and = 368 8

9 Step 4: Constructing an LCS The table returned by LCS-LENGTH enables us to quickly construct an LCS of and We simply begin at ] and trace through the table by following the arrows Whenever we encounter a in entry, it implies that is an element of the LCS that LCS-LENGTH found With this method, we encounter the elements of this LCS in reverse order A recursive procedure prints out an LCS of and in the proper, forward order 369 The square in row and column contains the value of and the appropriate arrow for the value of ] The entry 4 in [7,6] the lower right-hand corner of the table is the length of an LCS For >0, entry depends only on whether and the values in entries 1,, 1, and 1, 1, which are computed before To reconstruct the elements of an LCS, follow the ] arrows from the lower right-hand corner Each on the shaded sequence corresponds to an entry (highlighted) for which is a member of an LCS 370 9

10 Improving the code Each entry depends on only 3 other table entries: 1,, 1, and 1, 1 Given the value of, we can determine in (1) time which of these three values was used to compute, without inspecting table We can reconstruct an LCS in ) time The auxiliary space requirement for computing an LCS does not asymptotically decrease, since we need space for the table anyway 371 We can, however, reduce the asymptotic space requirements for LCS-LENGTH, since it needs only two rows of table at a time the row being computed and the previous row This improvement works if we need only the length of an LCS if we need to reconstruct the elements of an LCS, the smaller table does not keep enough information to retrace our steps in ) time

11 15.5 Optimal binary search trees We are designing a program to translate text Perform lookup operations by building a BST with words as keys and their equivalents as satellite data We can ensure an (lg ) search time per occurrence by using a RBT or any other balanced BST A frequently used word may appear far from the root while a rarely used word appears near the root We want frequent words to be placed nearer the root How do we organize a BST so as to minimize the number of nodes visited in all searches, given that we know how often each word occurs? 373 What we need is an optimal binary search tree Formally, given a sequence =,, distinct sorted keys ( ), we wish to build a BST from these keys For each key, we have a probability that a search will be for Some searches may be for values not in, so we also have +1 dummy keys,, representing values not in In particular, represents all values less than, represents all values greater than of

12 For = 1,2,, 1, the dummy key represents all values between and For each dummy key, we have a probability that a search will correspond to 375 Each key is an internal node, and each dummy key is a leaf Every search is either successful ( nds a key ) or unsuccessful ( nds a dummy key ), and so we have + =1 Because we have probabilities of searches for each key and each dummy key, we can determine the expected cost of a search in a given BST

13 Let us assume that the actual cost of a search equals the number of nodes examined, i.e., the depth of the node found by the search in +1 Then the expected cost of a search in, E search cost in = depth +1 + (depth +1) =1+ depth + depth, where depth denotes a node s depth in tree 377 Node Depth Probability Contribution Total

14 For a given set of probabilities, we wish to construct a BST whose expected search cost is smallest We call such a tree an optimal binary search tree An optimal BST for the probabilities given has expected cost 2.75 An optimal BST is not necessarily a tree whose overall height is smallest Nor can we necessarily construct an optimal BST by always putting the key with the greatest probability at the root The lowest expected cost of any BST with at the root is

15 Step 1: The structure of an optimal BST Consider any subtree of a BST It must contain keys in a contiguous range,,, for some In addition, a subtree that contains keys,, must also have as its leaves the dummy keys,, If an optimal BST has a subtree containing keys,,, then this subtree must be optimal as well for the subproblem with keys,, and dummy keys,, 381 Given keys,,, one of them, say, is the root of an optimal subtree containing these keys The left subtree of the root contains the keys,, (and dummy keys,, ) The right subtree contains the keys,, (and dummy keys,, ) As long as we examine all candidate roots, where, and determine all optimal BSTs containing,, and those containing,,, we are guaranteed to nd an optimal BST

16 Suppose that in a subtree with keys,,, we select as the root s left subtree contains the keys,, Interpret this sequence as containing no keys Subtrees, however, also contain dummy keys Adopt the convention that a subtree containing keys,, has no actual keys but does contain the single dummy key Symmetrically, if we select as the root, then s right subtree contains no actual keys, but it does contain the dummy key 383 Step 2: A recursive solution We pick our subproblem domain as nding an optimal BST containing the keys,,, where 1,, and 1 Let us de ne ] as the expected cost of searching an optimal BST containing the keys,, Ultimately, we wish to compute [1, ] The easy case occurs when 1 Then we have just the dummy key The expected search cost is 1 =

17 When >, we need to select a root from among,, and make an optimal BST with keys,, as its left subtree and an optimal BST with keys,, as its right subtree What happens to the expected search cost of a subtree when it becomes a subtree of a node? Depth of each node increases by 1 Expected search cost of this subtree increases by the sum of all the probabilities in it For a subtree with keys,,, let us denote this sum of probabilities as, = Thus, if is the root of an optimal subtree containing keys,,, we have ( +1, +1, ) Noting that , ) we rewrite , ) We choose the root that gives the lowest expected search cost: = if 1 min , ) if

18 The values give the expected search costs in optimal BSTs To help us keep track of the structure of optimal BSTs, we de ne root, for, to be the index for which is the root of an optimal BST containing keys,, Although we will see how to compute the values of root, we leave the construction of an optimal binary search tree from these values as en exercise 387 Step 3: Computing the expected search cost of an optimal BST We store values in a table ,0.. The rst index needs to run to +1because to have a subtree containing only the dummy key, we need to compute and store +1, The second index needs to start from 0 because to have a subtree containing only the dummy key, we need to compute and store 1,

19 We use only the entries for which 1 We also use a table root, for recording the root of the subtree containing keys,, This table uses only the entries We also store the [1.. +1,0.. ] values in a table For the base case, we compute 1 = For, we compute 1 + Thus, we can compute the ) values of, in (1) time each 389 OPTIMAL-BST(,, ) 1. let ,0.., [1.. +1,0.. ], root 1..,1.. be new tables 2. for = 1 to = 4. 1 = 5. for = 1 to 6. for = 1 to for to , ] 12. if ] root 15.return and root

391 The OPTIMAL-BST procedure takes ) time, just like MATRIX-CHAIN-ORDER Its running time is ), since its for loops are nested three deep and each loop index takes on at most values The loop indices

20 391 The OPTIMAL-BST procedure takes ) time, just like MATRIX-CHAIN-ORDER Its running time is ), since its for loops are nested three deep and each loop index takes on at most values The loop indices in OPTIMAL-BST do not have exactly the same bounds as those in MATRIX- CHAIN-ORDER, but they are within 1in all directions Thus, like MATRIX-CHAIN-ORDER, the OPTIMAL- BST procedure takes ( ) time

15.4 Longest common subsequence

15.4 Longest common subsequence Biological applications often need to compare the DNA of two (or more) different organisms A strand of DNA consists of a string of molecules called bases, where the possible