Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

Sequence omparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein

quick review: hallenges Find the best global alignment of two sequences Find the best global alignment of multiple sequences Find the best local (partial) alignment of two sequences Find the best match (alignment) of a given sequence in a longer dataset of sequences

quick review: Global lignment Global lignment Mission: Find the best global alignment between two sequences. e.g., Find the best alignment of GT and T: GT T GT- -T -GT- --T GT- -T GT- -T GT- -T G-T T- GT- -T orrect alignment vs. best alignment

quick review: Global lignment Global lignment Mission: Find the best global alignment between two sequences. n algorithm for finding the alignment with the best score method for scoring alignments

quick review: Scoring aligned bases Substitution matrix: Gap penalty: G T Linear gap penalty 10-5 0-5 ffine gap penalty -5 10-5 0 G 0-5 10-5 T -5 0-5 10 GT- -T d=-4-5 + 10 + -4 + 10 + -4 + 10 = 17

quick review: Global lignment Global lignment Mission: Find the best global alignment between two sequences. n algorithm for finding the alignment with the best score method for scoring alignments?

Exhaustive search lign the two sequences: GT and T GT T GT- -T -GT- --T GT- -T GT- -T GT- -T G-T T- GT- -T Simple (exhaustive search) algorithm 1) onstruct all possible alignments 2) Use the substitution matrix and gap penalty to score each alignment 3) Pick the alignment with the best score

omputational omplexity & the Big O Notation

Mission: Find the best global alignment of two sequences. search algorithm for finding the alignment with the best score method for scoring alignments? 1. More efficient search 2. recipe

Mission: Find the best global alignment of two sequences. search algorithm for finding the alignment with the best score method for scoring alignments? The Needleman Wunsch lgorithm

The Needleman Wunsch lgorithm n algorithm for global alignment on two sequences Dynamic Programming (DP) approach Yes, it s a weird name. DP is closely related to recursion and to mathematical induction We can prove that the resulting score is optimal.

DP matrix j 0 1 2 3 etc. i 0 1 2 3 4 5 T

DP matrix j 0 1 2 3 etc. i 0 1 2 3 4 5 T initial row and column

Best alignment of G to DP matrix j 0 1 2 3 etc. i 0 1 2 5 Which value are we interested in? 3 T 4 5 The value at (i,j) is the score of the best alignment of the first i characters of one sequence versus the first j characters of the other sequence.

DP matrix j 0 1 2 3 etc. i 0 1 2 3 4 T 5 The score of the best alignment of the two sequences. 5

Moving in the DP matrix 5 T

DP matrix 5 T

Best alignment of G and - DP matrix 5 T Moving horizontally in the matrix introduces a gap in the sequence along the left edge.

Best alignment of G and - DP matrix 5 1 T Moving horizontally in the matrix introduces a gap in the sequence along the left edge.

Best alignment of G and - T DP matrix 5 Moving vertically in the matrix introduces a gap in the sequence along the top edge. T 1

DP matrix 5 T

Best alignment of G and T DP matrix 5 Moving diagonally in the matrix aligns two residues T 0

So. an I now fill this matrix? T 7

So. an I now fill this matrix? T 7 3 3 17

Initialization T

Initialization 0 T

G - Introducing a gap 0-4 T

- Introducing a gap 0-4 -4 T

----- T omplete first row and column 0-4 -8-12 -16-20 -4-8 T -12-16 -20

What about i=1, j=1 i 0 1 2 3 4 5 j 0 1 2 3 etc. 0-4 -8-12 -16-20 -4? -8 T -12-16 -20

i 0 1 2 3 4 5 G- - Three ways to get to i=1, j=1 j 0 1 2 3 etc. 0-4 -8-12 -16-20 -4-8 -8 T -12-16 -20

i 0 1 2 3 4 5 -G - Three ways to get to i=1, j=1 j 0 1 2 3 etc. 0-4 -8-12 -16-20 -4-8 -8 T -12-16 -20

i 0 1 2 3 4 5 G 0-4 -8-12 -16-20 -4-5 -8 T -12-16 -20 Three ways to get to i=1, j=1 j 0 1 2 3 etc.

ccept the highest scoring of the three 0-4 -8-12 -16-20 -4-5 -8 T -12-16 Then simply repeat the same rule progressively across the matrix -20

DP matrix 0-4 -8-12 -16-20 -4-5 -8? T -12-16 -20

G- -G --G - -4+0=-4-5+-4=-9-8+-4=-12 DP matrix 0-4 -8-12 -16-20 -4-5 -4-8? T -12-16 -20 0-4

G- -5+-4=-9 -G -4+0=-4 --G - -8+-4=-12 DP matrix 0-4 -8-12 -16-20 -4-5 -4-8 -4 T -12-16 -20 0-4

DP matrix 0-4 -8-12 -16-20 -4-5 -8-4 T -12? -16? -20?

DP matrix 0-4 -8-12 -16-20 -4-5 -8-4 T -12-8 -16-12 -20-16

DP matrix 0-4 -8-12 -16-20 -4-5? -8-4? T -12-8? -16-12? -20-16?

Traceback 0-4 -8-12 -16-20 -4-5 -9-8 -4 5 What is the alignment associated with this entry? T -12-8 1-16 -12 2-20 -16-2

Traceback 0-4 -8-12 -16-20 -4-5 -9-8 -4 5 T -12-8 1-16 -12 2-20 -16-2 What is the alignment associated with this entry? Just follow the arrows back - this is called the traceback -G- T

Full lignment 0-4 -8-12 -16-20 -4-5 -9-8 -4 5 T -12-8 1-16 -12 2 ontinue and find the optimal global alignment, and its score. -20-16 -2?

Full lignment 0-4 -8-12 -16-20 -4-5 -9-13 -12-6 -8-4 5 1-3 -7 T -12-8 1 0 11 7-16 -12 2 11 7 6-20 -16-2 7 11 17

Full lignment 0-4 -8-12 -16-20 -4 Best -5 alignment -9 starts -13 at -12-6 -8 bottom -4 right 5 and follows 1-3 traceback arrows to top left -7 T -12-8 1 0 11 7-16 -12 2 11 7 6-20 -16-2 7 11 17

G-T T- One best traceback 0-4 -8-12 -16-20 -4-5 -9-13 -12-6 -8-4 5 1-3 -7 T -12-8 1 0 11 7-16 -12 2 11 7 6-20 -16-2 7 11 17

GT- -T nother best traceback 0-4 -8-12 -16-20 -4-5 -9-13 -12-6 -8-4 5 1-3 -7 T -12-8 1 0 11 7-16 -12 2 11 7 6-20 -16-2 7 11 17

GT- -T G-T T- 0-4 -8-12 -16-20 -4-5 -9-13 -12-6 -8-4 5 1-3 -7 T -12-8 1 0 11 7-16 -12 2 11 7 6-20 -16-2 7 11 17

Multiple solutions G-T T- GT- -T GT- -T When a program returns a single sequence alignment, it may not be the only best alignment but it is guaranteed to be one of them. In our example, all of the alignments at the left have equal scores. GT- -T

What s the complexity of this algorithm?

Practice problem: Find a best pairwise alignment of GT and TT 0 T T

DP in equation form lign sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty. d j i F d j i F y x s j i F j i F F j i 1, 1,, 1 1, max, 0 0,0

DP equation graphically F i 1, j 1 F i, j 1 s x, y i j d F i 1, j d F i, j take the max of these three