Sequence alignment. Genomes change over time

Size: px

Start display at page:

Download "Sequence alignment. Genomes change over time"

Shanna Hodges
5 years ago
Views:

1 Sequence alignment Genomes change over time 1

2 Goal of alignment: Infer edit operations What is sequence alignment? 2

3 Align biological sequences Statement of the problem Given 2 sequences Scoring system for evaluating match (or mismatch) of two characters Penalty function for gaps in sequences Produce: Optimal pairing of sequences that Retains the order of the sequences Introduces gaps Maximizes total score 3

Enumeration of all possible alignments Number of possible alignments of 2 sequences with length n and m For 2 sequences of length n n enumeration 10 184,756 20 1.

4 Enumeration of all possible alignments Number of possible alignments of 2 sequences with length n and m For 2 sequences of length n n enumeration , E E+58 Naïve Z box algorithm Exact string matching Boyer Moore algorithm Knuth Morris Pratt algorithm Dan Gusfield: Algorithms on Strings,Trees, and Sequences 4

5 = ordered tree data structure Suffix (prefix) tries (retrieval) Sequence ACACGT$ Suffix tree T $ 1 Suffix array G C A C T $ G $ T 6 $ A C G T $ 4 ACACGT$ CACGT$ ACGT$ CGT$ GT$ T$ $ A C G T $ 2 G O(m) preproc. time O(n+k) search time T $ 0 Burrows Wheeler transform (BWT) Append character (not part of alphabet) Cyclic Permutations ACACGT$ CACGT$A ACGT$AC CGT$ACA GT$ACAC T$ACACG $ACACGT sort lexicographic $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG Index of suffix array = S(i) B[i] (6,0,2,1,3,4,5) T$CAACC Compression by move to front encoding on BWT, run length encoding, and variable length prefix code 5

6 Backward search and FM index FM index (Full text index in Minute space) Based on Ferragina and Manzini, FOCS, 2000 Suffix array can be derived from BWT in linear time Idea is to take full advantage of the structure of suffix array for fast searching and BWT to reduce overall space occupancy Count number of occurrences and find all positions of P in text T by looking at only a small portion of the compressed text. Time: O(p), Space: O(n H k (T)) + o(n) Used for sequence alignment in high throughput sequencing applications (e.g. Bowtie) Backward search algorithm First Last BWT $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG C(c) $ A C G T Occ(c,q) rank $ A C G T

7 Backward search (exact string matching) First CAC Last i=3 CAC i=3 CAC i=2 $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG F=4 L=5 $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG F=2 L=3 CAC $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG F=4 L=4 i=1 CAC $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG First ACG Last i=3 i=3 ACG i=2 ACG $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG F=6 L=6 $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG F=5 L=5 ACG $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG F=3 L=3 i=1 ACG $ACACGT ACACGT$ ACGT$AC CACGT$A CGT$ACA GT$ACAC T$ACACG Hash table based alignment strategy 7

How to map billions of short reads onto genomes Maq Tophat Trapnell C, Salzberg S. Nature Biotech. 2009 Dot matrix Put one sequence along the top row of a matrix.

8 How to map billions of short reads onto genomes Maq Tophat Trapnell C, Salzberg S. Nature Biotech Dot matrix Put one sequence along the top row of a matrix. Put the other sequence along the left column of the matrix. Plot a dot every time there is a match between an element of row sequence and an element of the column sequence. Diagonal lines indicate areas of match. 8

9 Dot matrix Problems with dot matrices Rely on visual analysis Difficult to find optimal alignments Need scoring schemes more sophisticated than identical match Difficult to estimate significance of alignments 9

10 Biology of gaps Gap penalties We expect to penalize gaps the standard cost associated with a gap of length g: Linear gap penalty function (g) (g) = g*d g Convex gap penalty function (more realistic) Affine score: (g) = d (g 1)*e (g) d e gap open penalty gap extend penalty g 10

11 Distant scoring matrices Distances can be calculated between sequences: The higher the distance the smaller the similarity Distances fulfill the properties of a metric: - d(s,t) 0 d(s,t) d(s,u)+d(u,t) d(s,t) = d(t,s) - d(t,s) = 0 <=> t=s Distant scoring matrices Hamming distance: Number of letters in which sequences differ (not valid if the sequences have different length) s AAT AGCAA AGCACACA t TAA ACATA A CACACTA HD(s,t) Levenshtein distance: w(a,a)=0 w(a,b)=1 for a b w(,a)=w(b, )=1 deletion insertion s AGCACAC-A t A CACACTA d(s,t) 2 For two sequences, the distance is unique, but the optimal alignment (the one with minimal cost or distance) is not unique 11

12 Substitutions matrices Unrelated or random model assumes that letter a occurs independently with some frequency qa. P(x,y R) = qxi qxj The alternative match model of aligned pairs of residues occurs with a joint probability pab. P(x,y M) = pxi yi Odds ratio P(x,y M) pxi yi pxi yi = = P(x,y R) qxi qyj qxi qyj Substitution matrices Log odds ratio (score matrix or substitution matrix) S = s(xi,yi) where s(a,b) =log for aligned pair(a,b) s>0 more likely than random, s<0 less likely than random Physical properties of amino acids (e.g. hydrophob vs. hydrophil) are the reason that there are differences in the substitution scores pab qa qb Manually align protein structures (or, more risky, sequences) Look for frequency of amino acid substitutions at structurally nearly constant sites. 12

13 PAM matrices Margaret Dayhoff, 1978 Point Accepted Mutation (PAM) Look at patterns of substitutions in related proteins The new side chain must function the same way as the old one ( acceptance ) On average, 1 PAM corresponds to 1 amino acid change per 100 residues 1 PAM ~ 1% divergence Extrapolate to predict patterns at longer distances Assumptions PAM matrices Replacement is independent of surrounding residues Sequences being compared are of average composition All sites are equally mutable Sources of error Small, globular proteins used to derive matrices (departure from average composition) Errors in PAM 1 are magnified up to PAM 250 Does not account for conserved blocks or motifs 13

Henikoff and Henikoff, 1992 BLOSUM matrices Blocks Substitution Matrix (BLOSUM) Look only for differences in conserved, ungapped regions of a protein family More sensitive to structural or functional

14 Henikoff and Henikoff, 1992 BLOSUM matrices Blocks Substitution Matrix (BLOSUM) Look only for differences in conserved, ungapped regions of a protein family More sensitive to structural or functional substitutions BLOSUM n Contribution of sequences > n% identical weighted to 1 Substitution frequencies are more heavily influenced by sequences that are more divergent than this cutoff Clustering reduces contribution of closely related sequences Reducing n yields more distantly related sequences BLOSUM62 14

Summary of substitutions matrices Triple PAM strategy (Altschul, 1991) PAM 40 short alignments, highly similar PAM 120 PAM 250 longer, weaker local alignments BLOSUM

alignments No single matrix is the complete answer for all sequence comparisons Programs like BLAST usually have default matrices!

15 Summary of substitutions matrices Triple PAM strategy (Altschul, 1991) PAM 40 short alignments, highly similar PAM 120 PAM 250 longer, weaker local alignments BLOSUM (Henikoff, 1993) BLOSUM 90 short alignments, highly similar BLOSUM 62 most effective in detecting known members of a protein family BLOSUM 30 longer, weaker local alignments No single matrix is the complete answer for all sequence comparisons Programs like BLAST usually have default matrices! Dynamic programing: Fibonacci numbers function fib(n) fib_table[0] = 1 fib_table[1] = 1 for i in range(3,n): fib_table[i]= fib_table[i-1]+ fib_table[i-2] return fib_table[n] Run in linear time O(n) and constant space O(1) 15

16 Dynamic programing for sequence alignment Global alignment Sequence alignment Needleman Wunsch algorithm Local alignment Smith Waterman algorithm 16

17 Global alignment: Needleman Wunsch algorithm Construct a matrix F(i,j) where i is index from sequence 1 and j is the index from sequence 2 Starting with F(0,0)=0 F(i,j)= max F(i 1,j 1)+s(x i,y j ) F(i 1,j) d F(i,j 1) d substitution matrix gap penalty F(i-1,j-1) s(x i,y j ) F(i-1,j) -d F(i,j-1) -d F(i,j) Example with S=BLOSUM50 and d=8 Global sequence alignment start H E A G A W G H E E P A W H E A E HEAGAWGHE-E --P-AW-HEAE best score 17

18 Backtracing to get optimal alignment Pointer to the choice made at each step Remember all pointers and trace back Time needed O(m*n) Space needed O(m*n) How can this improved? Linear space alignment Do calculate the score for column j only column j 1 is needed H E A G A W G H E E P A W H E A E

19 Local alignment: Smith Waterman algorithm Look for best alignments between subsequences E.g. two proteins sharing a common domain Algorithm is similar to global alignment F(0,j) = F(i,0)=0 F(i,j) = max 0 F(i 1,j 1)+s(x i,y j ) F(i 1,j) d F(i,j 1) d Local alignment H E A G A W G H E E stop P A W H E best score A E AWGHE AW-HE 19

20 Issues for local alignment Time needed O(m*n) Space needed O(m*n) can be brought to O(m+n) Local similarities may occur in sequences with different structure or function that share common substructure/subfunction (domains, motifs) Database search for sequences 20

21 Database search How to answer the query We could just scan the whole database But: Query must be very fast Most sequences will be completely unrelated to query Individual alignment needs not be perfect. Can finetune Exploit nature of the problem If you re going to reject any match with idperc < 90%, then why bother even looking at sequences which don t have a fairly long stretch of matching a.a. in a row. 21

W mer indexing Preprocessing: For every W mer (e.g., W=3) store every location in the database where it occurs (can use hashing if W is large) Query: Generate W mers and look them up in the database.

22 W mer indexing Preprocessing: For every W mer (e.g., W=3) store every location in the database where it occurs (can use hashing if W is large) Query: Generate W mers and look them up in the database. Process the results Running time benefit: For W=3, if the sequences are random, then roughly one W mer in 23 3 will match, i.e., one in a ten thousand We hit only a small fraction of all sequences FASTA Use hash table of short words of the database (DB) sequence and query sequence (2 6 chars) For words in query sequence, find similar words in DB using (fast) hash table lookup, and compute R = position(query) position (DB). Areas of long match will show same R for many words. Score matching segments based on content of these matches. Extend the good matches empirically. 22

23 BLAST Finds inexact, ungapped seeds using a hashing technique (like FASTA) and then extends the seed to maximum length possible. Based on strong statistical/significance framework What is a significantly high score of two segments of length N and M? Most commonly used for fast searches and alignments. New versions now do gapped segments. High scoring segment pairs 23

High scoring segment pairs Receive query Split query into overlapping words of length W Find neighborhood words for each word until threshold T Look into the table where these neighbor words occur:

24 High scoring segment pairs Receive query Split query into overlapping words of length W Find neighborhood words for each word until threshold T Look into the table where these neighbor words occur: seeds Extend seeds until score drops off under X Evaluate statistical significance of score Report scores and alignments Significance of scores The number of unrelated matches with score greater than S is approximately Poisson distributed with mean E(S)=Kmne λs where λ is a scaling factor m and n are the length of the sequences The probability that there is a match of score greater than S follows a extreme value distribution: P(x>S)=1 e E(S) Karlin S, Altschul S. Proc Natl Acad Sci (1990) 24

frame translation Protein Nucleotide six frame translation Protein

25 NCBI Blast Program Query sequence Subject sequence BLASTN Nucleotide Nucleotide BLASTP Protein Protein BLASTX TBLASTN TBLASTX Nucleotide six frame translation Protein Nucleotide six frame translation Protein Nucleotide six frame translation Nucleotide six frame translation NCBI Blast Example 25

Blast Results conserved domain database (CDD) graphical visualization Best hit description E value Score (S) alignment MegaBlast MegaBLAST uses a

This program is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors".

26 Blast Results conserved domain database (CDD) graphical visualization Best hit description E value Score (S) alignment MegaBlast MegaBLAST uses a greedy algorithm for the nucleotide sequence alignment search. This program is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". When larger word size is used (see explanation below), it is up to 10 times faster than more common sequence similarity programs. Mega BLAST is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm. 26

27 BLAT BLAST like alignment tool UCSC genome browser ( bin/hgblat) Designed to rapidly align longer nucleotide sequences (L 40) having >95% sequence similarity 500 times faster than BLAST for mrna/cdna searches On DNA, Blat works by keeping an index of an entire genome (kmers) in memory. Thus, the target database of BLAT is not a set of GenBank sequences, but instead an index derived from the assembly of the entire genome. It may miss more divergent or short sequence alignments. Can be used also for protein sequences Multiple sequence alignment Often simple extension of pairwise alignment: Given: Set of sequences Match matrix Gap penalties Find: Alignment of sequences such that optimal score is achieved. 27

28 Goals of multiple sequence alignment Determine Consensus Sequences Prosite, emotif ClustalW, MACAW, Pileup, T Coffee Building Gene Families Blocks, Prints, ProDom, pfam, DOMO, eblocks Develop Relationships & Phylogenies Clusters Relationships Evolutionary Models Phylip, GrowTree, MACAW, PAUP Model Protein Structures for Threading and Fold Prediction Profiles, Templates, HSSP, FSSP Hidden Markov Models, pfam, SAM Network Models, Neural Nets, Belief Nets Statistical Models, Generalized Linear Models Exhaustive search using Dynamic Programming Why not just use same technique as for pairwise alignment? Instead of 2 dimensional SCORE matrix, use N dimensional. Fill from one corner to diagonal corner in N dimensions. Complexity increases with number of sequences O(MN), so only N < 10 and lengths (M)~ 200 can be accommodated. 28

29 Dynamic Programming Dynamic Programming 29

30 MSA Algorithm Based on dynamic programming concept: 1. Compute optimal pairwise alignments to get upperbound on any pair of alignments. (MA can t do any better than sum of optimal pairwise alignments.) 2. Create heuristic multiple alignment in ad hoc fashion to create lowerbound on MA score (e.g. align all sequences to the first). 3. Search N dimensional scoring matrix (as in pairwise case) for optimal path, where S[i,j,k ] is the best score including ith element of sequence 1, jth of sequence 2, kth of sequence 3, etc Greedy algorithm 1. Select most similar pair of sequences 2. Join these sequences to build a profile. This reduce the number of sequences/profiles from k to k 1 3. Repeat until only one profile is left this heuristic approach is called greedy algorithm Example: s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC 30

31 Greedy algorithm s2 GTCTGA s4 GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s1 GAT-TCA s3 GATAT-T (score = 1) s1 GATTCA-- s4 G T-CAGC(score = 0) s2 G-TCTGA s3 GATAT-T (score = -1) s3 GAT-ATT s4 G-TCAGC (score = -1) s2 s4 GTCTGA GTCAGC s 2,4 GTCt/aGa/c s 1 s 3 s 2,4 GATTCA GATATT GTCt/aGa/c Progressive Alignment: Tree method 1. Perform hierarchical clustering (similar to arrays) 2. Merge sequences to find ancestor sequences by finding sequences with minimum edit distance to the two children sequences (see next slides). 3. Assign weights to each branch of tree, based on distance between sequences (see next slides) 4. Align sequences (starting from the closest, using a version of dynamic programming) using weights in the score function (see next slides) 31

32 ClustalW 32

33 Markov chains Markov chains: a sequence of events that occur one after another. The main restriction on a Markov chain is that the probability assigned to an event at any location in the chain can depend on only a fixed number of previous events. Scoring sequences (e.g. start codon ATG) 3 states (S1, S2, S3), p(a)=p(c)=p(g)=p(t)=0.25 S1 S2 S3 A T G p(a)=0.91 p(c)=0.03 p(g)=0.03 p(t)=0.03 p(a)=0.03 p(c)=0.03 p(g)=0.03 p(t)=0.91 p(a)=0.03 p(c)=0.03 p(g)=0.91 p(t)=0.03 Markov chain 0 th order p(atg)= =0.752 Markov chain 1 th order p(atg)=p(a)*p(t A)*p(G T) Markov chain What is the probability that we are looking at a start codon, given that the sequence is CTG? P(M CTG)= P(CTG M) * P(M)/P(CTG) (from Bayes theorem) P(CTG)= = P(CTG M)=0.03*0.91*0.91= P(M) not relevant 33

Hidden Markov Model (HMM) Example exon intron border 3 states: exon(e), 5 SS (5), intron (I) Emission probabilities HMM parameters (Θ) Given (S) Hidden, want to infer(π) (hidden Markov chain) Find

34 Hidden Markov Model (HMM) Example exon intron border 3 states: exon(e), 5 SS (5), intron (I) Emission probabilities HMM parameters (Θ) Given (S) Hidden, want to infer(π) (hidden Markov chain) Find best state path (highest score) If man possible paths than use efficient Viterbi algorithm (based on dynamic programing) G T A A G T C A log P(S,π HMM,Θ)=log(1* * *0.1*0.95*1.0*0.4*0.9*0.4*0.9*0.4*0.9*0.1*0.9*0.4*0.9*0.1*0.9*0.4*0.1) Eddy SR, Nat Biotech 2004 Profile Hidden Markov Model - For multiple alignments (e.g. DNA sequences) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC P(A)=0.2 P(C)=0.4 P(G)=0.2 P(T)= Regular Expressions [AT][CG][AC][ACGT]*A[TG][GC] insertion state P(A)=0.8 P(C)=0.0 P(G)=0.0 P(T)=0.2 P(A)=0.0 P(C)=0.8 P(G)=0.2 P(T)=0.0 P(A)=0.8 P(C)=0.2 P(G)=0.0 P(T)=0.0 P(A)=1.0 P(C)=0.0 P(G)=0.0 P(T)=0.0 P(A)=0.0 P(C)=0.0 P(G)=0.2 P(T)=0.8 P(A)=0.0 P(C)=0.8 P(G)=0.2 P(T)=0.0 p(acacatc)=0.8*1*0.8*1*0.8*0.6 *0.4*0.6*1*1*0.8*1*0.8=0.047 log odds=log(p(s)/0.25 L )=log(0.047/ ) 34

family Insert states Delete (silent, null) states Insert states to model highly variable regions

35 Profile Hidden Markov Model Allows position dependent gap penalties Can be obtained from a multiple alignment (DNA or Protein) Can be used for searching a database for other members of the family Insert states Delete (silent, null) states Insert states to model highly variable regions in the alignemnt Main states (gray) Avoid overfitting by using pseudocounts (e.g. add 1 to all counts) 35

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find