In this chapter we will very briefly review the most common algorithmic techniques which are used in bioinformatics.

Size: px

Start display at page:

Download "In this chapter we will very briefly review the most common algorithmic techniques which are used in bioinformatics."

Mavis Hines
5 years ago
Views:

1 Algorithm Techniques In this chapter we will very briefly review the most common algorithmic techniques which are used in bioinformatics. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

2 Algorithms An algorithm is a well-defined and finite sequence of steps used to solve a well-defined problem. Algorithms that solve all instances of the problem for which they were designed are said to be correct. The running time of an algorithm is the number of machine instructions it executes when run on a particular instance. For the analysis of the algorithm the running time is computed for the worst case instance of the problem. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

3 Running time Computers need determined amount of time t op for the execution of some operation (e. g s) Algorithms need a determined amount of steps s If t op and s is known running time of algorithm: t op s Since t op changes constantly we base on s (independent of hardware) Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

4 Running time Computers need determined amount of time t op for the execution of some operation (e. g s) Algorithms need a determined amount of steps s If t op and s is known running time of algorithm: t op s Since t op changes constantly we base on s (independent of hardware) s is not always easy to determine depends on input n Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

5 Big-O Notation Big-O for describing the running time of an algorithm O(n 2 ) running time of the algorithm is limited by a 2nd degree polynomial f (n) = O(n 2 ): f doesn t grow faster than c n 2 for a c 2n = O(n 2 ) valid, but uninformative more informative 2n = O(n) Big-O establishes an upper bound for the growth of a function. If f (n) = O(g(n)),then f doesn t grow faster than g Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

6 Definitions Let f and g be real functions 1 One writes f (x) = O(g(x)) if and only if there exists c and x 0 (c,x 0 R, c 0) such that f (x) c g(x) for all x x 0 2 One writes f (x) = Ω(g(x)) if and only if there exists c and x 0 (c,x 0 R, c 0) such that f (x) c g(x) for all x x 0 3 One writes f (x) = Θ(g(x)) if and only if f (x) = O(g(x)) and f (x) = Ω(g(x)) Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

7 Example: Sorting Algorithms Sorting Problem: Sort a list of integers Input: A list of n distinct integers a = (a 1, a 2,..., a n ) Output: Sorted list of integers, that is, a reordering b = (b 1, b 2,..., b n ) of integers from a such that b 1 < b 2 < < b n Selection Sort Algorithm: SELECTIONSORT(a, n) 1 for i 1 to n 1 2 a j Smallest element among a i, a i+1,..., a n 3 Swap a i and a j 4 return a Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

8 Example: Sorting Algorithms Recursive Selection Sort: RECURSIVESELECTIONSORT(a, first, last) 1 if first < last 2 index INDEXOFMIN (a, first, last) 3 Swap a first with a index 4 a RECURSIVESELECTIONSORT(a, first+1, last) 5 return a INDEXOFMIN (array, first, last) 1 index first 2 for k first +1 to last 3 if array k < array index 4 index k 5 return index Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

9 Complexity Analysis n 1 iterations Analyzes n i + 1 elements in each iteration i The aprox. number of operations: n + (n 1) + (n 2) = n = n(n+1) 2 In each iteration a swap: 3 ops Total: n(n+1) 2 + 3(n 1) O(n 2 ) SELECTIONSORT(a, n) 1 for i 1 to n 1 2 j INDEXOFMIN (a, i, n) 3 Swap elements a i and a j 4 return a INDEXOFMIN (array, first, last) 1 index first 2 for k first+1 to last 3 if array k < array index 4 index k 5 return index Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

10 Complexity Analysis Let T (n) be the time an algorithm needs for the input of size n Finding the smaller n max. n recursive call on array of size n 1 T (n 1) Call on array of size 1 It holds: T (n) = n + T (n 1) T (1) = 1 T (n) = n + (n 1) + T (n 2) = n+(n 1)+(n 2) T (1) O(n 2 ) Recursive Selection Sort: 1 RECURSIVESELSORT(a, first, last) 2 if first < last 3 index INDEXOFMIN (a, first, last) 4 Swap a first with a index 5 a RECURSIVESELSORT(a, first+1, last) 6 return a Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

11 Algorithms Conceptually we distinguish Algorithm strategy Algorithm structure recursive iterative Algorithm solution find a good solution find best(s) solution(s) Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

12 Algorithm strategies Brute force algorithms Greedy algorithms Recursive algorithms Backtracking algorithms Branch and bound algorithms Divide and conquer algorithms Dynamic programming algorithms Heuristic algorithms Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

Brute Force or Exhaustive Search Systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem s

13 Brute Force or Exhaustive Search Systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem s statement Simple Very slow Used as starting point for other types of algorithms Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

14 Greedy Many algorithms are iterative processes Greedy algorithms choose in each iteration the more attractive solution Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

15 Recursive A combinatorial problem: Fibonacci numbers n F n The problem of the Fibonacci numbers is a classical example for a recursion problem: F 0 = 0 F 1 = 1 F n = F n 1 + F n 2 Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

16 Recursions Recursions: reapply algorithm to subproblem Another example: N!, the factorial of a number N: function fact(n){ if(n==1) return 1 else return N*fact(N-1) } Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

17 Backtracking Backtracking is a general technique for organizing the exhaustive search for a solution to a combinatorial problem. The backtracking technique can be applied to those problems that exhibit the domino principle: if a constraint (condition) is not satisfied by a partial solution, the constraint will not be satisfied by any extension of the partial solution to a global solution. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

18 Backtracking Domino principle w... n n+1 h Given h (height of a domino) > w (space in between dominos): we knock over the first domino if nth domino falls, then (n + 1)st domino will fall. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

19 Backtracking The backtracking algorithm enumerates a set of partial candidates that could be completed in various ways to give all the possible solutions to the given problem. The way towards the solution is done incrementally, by a sequence of candidate extension steps. Conceptually, the partial candidates are the nodes of a tree, the search tree Each partial candidate is the parent of the candidates that differ from it by a single extension step Leaves of the tree are the partial candidates that cannot be further extended The backtracking algorithm traverses this search tree recursively, from the root down, in depth-first order Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

20 Backtracking Root At each node c, the algorithm checks whether c can be completed to a valid solution If it cannot, the whole sub-tree rooted at c is skipped (pruned) Otherwise, the algorithm (a) checks whether c itself is a valid solution and (b) recursively enumerates all sub-trees of c The actual search tree that is traversed by the algorithm is only a part of the tree. The total cost of the algorithm is the number of nodes of the actual tree times the cost of obtaining and processing each node. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

(brute force: 64 8 = 281.474.976.710.656) The actual search tree is only a part of the tree.

21 Backtracking Example: Eight queens puzzle How to place 8 queens in a chess board Consider one row of the board at a time Eliminate most nonsolution board positions at a very early stage It rejects attacks on incomplete boards, hence it examines only possible queen placements (brute force: 64 8 = ) The actual search tree is only a part of the tree. The total cost of the algorithm is the # nodes of the actual tree the cost of obtaining and processing each node. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

22 Branch-and-Bound The branch-and-bound method can be used for finding one or all solutions of a combinatorial problem, where solutions are associated with a cost, such that the cost of the whole solution cannot be smaller than the cost of any partial solution optimization problems The technique consists of remembering the lowest-cost solution found at each stage of the backtracking search, and to use the cost of the lowest-cost solution found so far as a lower bound on the cost of a least-cost solution to the problem, in order to discard partial solutions with costs larger than the lowest-cost solution found so far. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

23 Branch-and-Bound Represent again as a tree: The root of the bb-tree is a so-called dummy node of cost zero, the nodes at level one represent the possible values which the first variable can be assigned to, the nodes at level two represent the possible values which the second variable can be assigned to, given the value which the first variable was assigned to, and so on. Subtrees in the tree rooted at nodes of cost greater than the cost of a previous leaf node, are pruned off the bb-tree. 1 A 2 B C E S C E C S 2 S C Problem: can become exponential Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

24 Divide-and-Conquer Definition: An algorithmic technique. To solve a problem on an instance of size n, a solution is found either directly because solving that instance is easy (typically, because the instance is small) or the instance is divided into two or more smaller instances. Each of these smaller instances is recursively solved, and the solutions are combined to produce a solution for the original instance. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

25 Divide-and-Conquer Methodology 1 Given a problem, identify a small number of significantly smaller subproblems of the same type 2 Solve each subproblem recursively (the smallest possible size of a subproblem is a base-case) 3 Combine these solutions into a solution for the main problem The name divide and conquer can be motivated because the problem is conquered by dividing it into several smaller problems. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

26 Divide-and-Conquer The divide-and-conquer technique can be applied to those problems that exhibit the independence principle: problem instance can be divided into a series of smaller problem instances which are independent of each other. Example: One of the simplest examples is Quicksort of an array: Partition the array into two parts, and quicksort each of the parts. Here in fact, no additional work is required to combine the two sorted parts. Running time: O(n 2 ) Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

27 Divide-and-Conquer When a problem is solved by divide-and-conquer, sometimes the same subproblem appears multiple times. A recursive algorithm for the divide-and-conquer according to this definition is: Fibonacci-R(i){ if i = 0 then return 0 else{ if i = 1 then return 1 else return Fibonacci-R(i-1) + Fibonacci-R(i-2)} } Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

28 Divide-and-Conquer However, it is easy to see that the algorithm is not efficient, since values of F i are calculated several times independently. n n-1 n n-2 n-2 n-3 n-3 n-4 n-3 n-4 n-4 n-5 n-4 n-5 n-5 n-6 Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

29 Randomized Algorithms Toss a coin to decide where to start looking for the phone Not as intuitive as deterministic algorithms Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

30 Machine Learning Collect statistics over the course of a year about where you leave the phone, learning where the phone tends to end up most of the time. E. g. 80% of the times it was left on the bathroom, 15% in the bedroom and 5% in the kitchen Strategy: first look in the bathroom, the in the bedroom and finally in the kitchen Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

31 Dynamic Programming Dynamic Programming is a very general programming technique. Most often applied in the construction of algorithms to solve a certain class of optimisation problems, ie. problems which require the minimisation or maximisation of some measure. Applicable when a large search space can be structured into a succession of stages, such that the initial stage contains trivial solutions to sub-problems, each partial solution in a later stage can be calculated by recurring on only a fixed number of partial solutions in an earlier stage, the final stage contains the overall solution. The method usually accomplishes this by maintaining a table or matrix of sub-instance results. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

32 Dynamic Programming Dynamic programming can be thought of as being the reverse of recursion or divide-and-conquer.? Divide-and-conquer is a top-down mechanism we take a problem, split it up, and solve the smaller problems that are created.? Dynamic programming is a bottom-up mechanism we solve all possible small problems and then combine them to obtain solutions for bigger problems. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

33 Dynamic Programming A general DP algorithm consists of 4 steps: 1 Characterization of the structure of the (an) optimal solution 2 Recursive definition of the value of an optimal solution 3 Computation of the optimum using recursion 4 Construction of an optimal solution through the computed optimal value. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

34 Dynamic Programming Example: The Rocks game 2 players, 2 piles of rocks, say 10 each In each turn one player may take either one rock (from either pile) or two rocks (one from each pile). Taken rocks are removed from the game. The player that takes the last rock wins the game To find the winning strategy we construct a table R: If Player 1 can always win the game (i,j), then we would say R ij = W If Player 1 looses the game R ij, then we would say R ij = L Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

35 Dynamic Programming Example: The Rocks game Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

36 Dynamic Programming Example: The Rocks game Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

37 Tractable vs. Non Tractable Problems Algorithms can be classified accoriding to its complexity Problems might also be classified according to its inherent complexity There are problems, for which there is no non polynomial algorithm: enumerate all subsets of n elements Other problems can be solved in polynomial time Between these two, exponential and polinomial problems, lie the NP-complete Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

38 Tractable vs. Non Tractable Problems Problems for which there is no known polynomial algorithm, but for which you cannot prove that it does t exist The classic: Traveling-Salesman Problem Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

39 Literature Sources and further recommended reading: Schöning, Algorithmik, Spektrum Akademischer Verlag, Kay Nieselts Lecture Notes (Grundlagen der Bioinformatik SS 2007), Karls-Eberhard Universität Tübingen N. C. Jones and P. A. Pevzner, An Introduction to Bioinformatics Algorithms, 2004 Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

40 DNA Mapping, Motifs and Brute Force Algorithms In this chapter we will see: Restriction Enzymes Gel Electrophoresis Partial Digest Problem Brute Force Algorithm for Partial Digest Problem Branch and Bound Algorithm for Partial Digest Problem Double Digest Problem Finding Regulatory Motifs Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

41 Molecular Scissors Molecular Cell Biology, 4th edition Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

42 Molecular Scissors eplantscience.com, An online botanical encyclopedia, Chapter 3. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

43 Uses of restriction enzymes Recombinant DNA technology Recombinant technology starts with the isolation of a gene of interest. It is then inserted into a vector and cloned Recombinant protein result form the expression of rdna DNA Cloning Is a technique to reproduce DNA fragments. Cell based or via PCR cdna/genomic library construction mrna cdna restriction enzyme + ligase into plasmid genomic regions DNA mapping Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

44 Restriction maps A map showing positions of restriction sites in a DNA sequence If DNA sequence is known then construction of restriction map is a trivial exercise In early days of molecular biology DNA sequences were often unknown Biologists had to solve the problem of constructing restriction maps without knowing DNA sequences Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

45 Full Restriction Digest A map showing positions of restriction sites in a DNA sequence Cutting DNA at each restriction site creates multiple restriction fragments: Is it possible to reconstruct the order of the fragments from the sizes of the fragments 3,5,5,9? Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

46 Full Restriction Digest Multiple Solutions vs. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

Measuring length of fragments: Gel electrophoresis Gel electrophoresis: process for separating DNA by size and measuring sizes of restriction fragments Separates DNA fragments that differ in only 1

47 Measuring length of fragments: Gel electrophoresis Gel electrophoresis: process for separating DNA by size and measuring sizes of restriction fragments Separates DNA fragments that differ in only 1 nucleotide for fragments up to 500 Using an electric field, molecules can be made to move through a gel (agar) Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

When the electric current is applied, the larger molecules move more slowly through the gel

48 Measuring length of fragments: Gel electrophoresis The gel is placed in an electrophoresis chamber. When the electric current is applied, the larger molecules move more slowly through the gel while the smaller molecules move faster. The different sized molecules form bands on the gel Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

49 Detecting DNA One possibility to visualize DNA bands: Fluorescence The gel is incubated with a solution containing the fluorescent dye ethidium Ethidium binds to the DNA The DNA lights up when the gel is exposed to ultraviolet light. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

50 Partial Restriction Digest The sample of DNA is exposed to the restriction enzyme for only a limited amount of time to prevent it from being cut at all restriction sites This experiment generates the set of all possible restriction fragments between every two (not necessarily consecutive) cuts This set of fragment sizes is used to determine the positions of the restriction sites in the DNA sequence Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

51 Partial Restriction Digest: Example Partial Digest results in the following 10 restriction fragments: Multiset: {3, 5, 5, 8, 9, 14, 14, 17, 19, 22} Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

52 Partial Restriction Digest: Example Partial Digest results in the following 10 restriction fragments: Multiset: {3, 5, 5, 8, 9, 14, 14, 17, 19, 22} We assume that multiplicity of a fragment can be detected, i.e., the number of restriction fragments of the same length can be determined (e.g., by observing twice as much fluorescence intensity for a double fragment than for a single fragment) Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

53 Partial Digest Fundamentals: X: the set of n integers representing the location of all cuts in the restriction map, including the start and end n: the total number of cuts DX: the multiset of integers representing lengths of each of the ( ) n 2 fragments produced from a partial digest Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

54 Partial Digest A way of representating n, X, DX : Representation of DX = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} as a two dimensional table, with elements of X = {0, 2, 4, 7, 10} along both the top and left side. The elements at (i, j) in the table is x j x i for 1 i < j n. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

55 Partial Digest Problem Formulation: Goal: Given all pairwise distances between points on a line, reconstruct the positions of those points Input: The multiset of pairwise distances L, containing n(n 1) 2 integers Output: A set X, of n integers, such that DX = L Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

56 Partial Digest Problem: Multiple Solutions It is not always possible to uniquely reconstruct a set X based only on DX For example, the set: X = {0, 2, 5} and X + 10 = {10, 12, 15} both produce DX = {2, 3, 5} as their partial digest set. The sets {0, 1, 2, 5, 7, 9, 12} and {0, 1, 5, 7, 8, 10, 12} present a less trivial example of non-uniqueness. They both digest into: {1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, 8, 9, 10, 11, 12} Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

57 Homometric Sets Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

58 Brute Force Algorithms Also known as exhaustive search algorithms; examine every possible variant to find a solution Efficient in rare cases; usually impractical Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

59 Partial Digest Problem: Brute Force 1. Find the restriction fragment of maximum length M. M is the length of the DNA sequence 3. For every possible set X = {0, x 2,..., x n 1, M} compute the corresponding DX 5. If DX is equal to the experimental partial digest L, then X is the correct restriction map BruteForcePDP(L, n): 1. M maximum element in L 2. for every set of n 2 integers 0 < x 2 <...x n 1 < M X {0, x 2,..., x n 1, M} 3. Form DX from X 4. if DX = L 5. return X 6. output no solution Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

60 Efficiency of Brute Force BruteForcePDP takes O(M n 2 ) time since it must examine all possible sets of positions. One way to improve the algorithm is to limit the values of x i to only those values which occur in L. BruteForcePDP(L, n): 1. M maximum element in L 2. for every set of n 2 integers 0 < x 2 <...x n 1 < M X {0, x 2,..., x n 1, M} 3. Form DX from X 4. if DX = L 5. return X 6. output no solution Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

61 Efficiency of Brute Force BruteForcePDP takes O(M n 2 ) time since it must examine all possible sets of positions. One way to improve the algorithm is to limit the values of x i to only those values which occur in L. AnotherBruteForcePDP(L, n): 1. M maximum element in L 2. for every set of n 2 integers 0 < x 2 <...x n 1 < M from L X {0, x 2,..., x n 1, M} 3. Form DX from X 4. if DX = L 5. return X 6. output no solution Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

62 Efficiency of AnotherBruteForcePDP Its more efficient, but still slow. This algorithm examines ( ) L n 2 If L = {2, 998, 1000}, (n = 3, M = 1000), BruteForcePDP will be extremely slow, but AnotherBruteForcePDP will be quite fast Fewer sets are examined, but runtime is still exponential: O(n 2n 4 ) Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

63 Branch and Bound Algorithm for PDP 1 Begin with X = {0} 2 Remove the largest element in L and place it in X 3 See if the element fits on the right or left side of the restriction map 4 When it fits, find the other lengths it creates and remove those from L 5 Go back to step 1 until L is empty Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

64 PartialDigest Algorithm Before describing PartialDigest, first define D(y, X ) as the multiset of all distances between point y and all other points in the set X for X = {x 1, x 2,..., x n } D(y, X ) = y x 1, y x 2,..., y x n PartialDigest(L): width Maximum element in L DELETE(width, L) X {0, width} PLACE(L,X ) Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

65 PartialDigest Algorithm PLACE(L, X) 2. if L is empty 3. output X 4. return 5. y maximum element in L 6. Delete(y,L) 7. if D(y, X ) L 8. Add y to X and remove lengths D(y, X) from L 9. PLACE(L,X ) 10. Remove y from X and add lengths D(y, X) to L 11. if D(width-y, X ) L 12. Add width-y to X and remove lengths D(width-y, X) from L 13. PLACE(L,X ) 14. Remove width-y from X and add lengths D(width-y, X ) to L 15. return Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

66 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0} Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

67 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0} Remove 10 from L and insert it into X. We know this must be the length of the DNA sequence because it is the largest fragment. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

68 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 10} Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

69 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 10} Take 8 from L and make y = 2 or 8. But since the two cases are symmetric, we can assume y = 2. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

70 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 10} We find that the distances from y = 2 to other elements in X are D(y, X ) = {8, 2}, so we remove {8, 2} from L and add 2 to X. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

71 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 10} We find that the distances from y = 2 to other elements in X are D(y, X ) = {8, 2}, so we remove {8, 2} from L and add 2 to X. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

72 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 10} Take 7 from L and make y = 7 or y = 10 7 = 3. We will explore y = 7 first, so D(y, X ) = {7, 5, 3}. D(y, X ) = {7, 5, 3} = {7 0, 7 2, 7 10} Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

73 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 10} For y = 7 first, D(y, X ) = {7, 5, 3}. Therefore we remove {7, 5, 3} from L and add 7 to X. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

74 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 7, 10} For y = 7 first, D(y, X ) = {7, 5, 3}. Therefore we remove {7, 5, 3} from L and add 7 to X. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

75 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 7, 10} Take 6 from L and make y = 6. Unfortunately D(y, X ) = {6, 4, 1, 4}, which is not a subset of L. Therefore we won t explore this branch. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

76 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 7, 10} This time make y = 4. D(y, X ) = {4, 2, 3, 6}, which is a subset of L so we will explore this branch. We remove {4, 2, 3, 6} from L and add 4 to X. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

77 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 4, 7, 10} Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

78 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 4, 7, 10} L is now empty, so we have a solution, which is X. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

79 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 7, 10} To find other solutions, we backtrack. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

80 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 10} More backtrack. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

81 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 2, 10} This time we will explore y = 3. D(y, X ) = {3, 1, 7}, which is not a subset of L, so we won t explore this branch. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

82 An Example L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} X = {0, 10} We backtracked back to the root. Therefore we have found all the solutions. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

83 Complexity analysis of PartialDigest Problem Still exponential in worst case, but is very fast on average Informally, let T (n) be the time PartialDigest takes to place n cuts: No branching case: T (n) = T (n 1) + O(n) Quadratic Branching case: T (n) = 2T (n 1) + O(n) = T (n) = 2(2T (n 2) + O(n)) + O(n) Exponential Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

84 Double Digest Problem (DDP) Double Digest is yet another experimentally method to construct restriction maps Uses two restriction enzymes; three full digests: One with only first enzyme One with only second enzyme One with both enzymes Computationally, Double Digest problem is more complex than Partial Digest problem Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

85 Double Digest Problem (DDP) Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

86 Double Digest Problem (DDP) Without the information about X (i.e. A + B), it is impossible to solve the DDP Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

87 Double Digest Problem Formulation: Input: da fragment lengths from the digest with enzyme A db fragment lengths from the digest with enzyme B dx fragment lengths from the digest with both A and B Output: A location of the cuts in the restriction map for the enzyme A. B location of the cuts in the restriction map for the enzyme B. Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

88 DDP: Multiple Solutions Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 3rd, / 83

Algorithm classification

Types of Algorithms Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We ll talk about a classification scheme for algorithms This classification scheme