Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--

Size: px

Start display at page:

Download "Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--"

Hannah Chambers
5 years ago
Views:

1 Sequence Alignment

2 Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC-- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

3 Distance from sequences Mutation events Mutation we need a score for the substitution of the symbol i with j (amino acidic residues, nucleotides, etc.) substitution matrices s(i,j) A: ALASVLIRLITRLYP B: ASAVHLNRLITRLYP The substitution matrix should account for the underlying biological process (conserved structures or functions)

4 Substitution matrices The basic idea is to measure the correlation between 2 sequences Given a pair of correlated sequences we measure the substitution frequency of a-> b ( assuming symmetry) Pab The null hypothesis (random correlation or independent events) is the product PaPb We then define s (a,b) = log(pab/papb) (log is additive so that the probability => sum over pairs) Minimal distance = Maximum score (s)

5 Substitution matrices Following this idea several matrices have been derived Their main difference is relative to the alignment types used for computing the frequencies PAMx: (Point Accepted Mutation). Number x % of mutation events. The matrix is built as: A1ik = P(k i) for sequences with 1% mutations. Anik=(A1ik)n n % mutation events (number of mutations NOT mutated residues. E.g.: n=2 P(i k) = a P(i a) P(a k) PAMn = Log(Anij /Pi)

6 PAM1 Very stringent matrix, no out of diagonal element is >

7 PAM16 Some positive values outside the diagonal appear

8 PAM25 This is one of the most widely used matrix

9 PAM5

10 Substitution matrices PAM matrices are computed starting from very similar sequences and more distance scoring matrices are derived by matrix products BLOSUMx: This family is computed directly from blocks of alignments with a defined sequence identity threshold (> x%). => For very related sequences we must use PAM with low numbers and BLOSUM with large numbers. The opposite holds for distant sequences

11 BLOSUM62 Probably the most widely used

12 BLOSUM9

13 BLOSUM3

14 Exercise: Write your own substitution Matrix: Given e set of aligned sequences compute the Pab=frequency of mutations between a->b (assume symmetry, a->b counts also as b->a). Pa=as the marginal probability of Pab Finally, derive the substitution matrix: s (a,b) = log(pab/papb) Use the files provided at: Hints: 1.start with toy.aln to check your algorithm 2. Initialize your P[a][b] matrix with pseudocounts (such as 1 instead of )

15 DotPlots:

16 DotPlot Exercise: Write a program dotplot.py That takes as input a fasta file with two sequences A scoring matrix and sliding window and the threshold cut-off, such as: dotplot.py fasta.in score.mat 11 threshold and prints on the standard oputput the dotplot. PS: you can use matplotlib for a nicer presentation

17 Distance between sequences What kind of operations we consider? Mutation Deletion and Insertion (rarely rearrangements ) A: ALASVLIRLIT--YP B: ASAVHL---ITRLYP The negative gap score value depends only on the number of holes (n) = -nd linear (n) = -d - (n-1)e affine (d: open e: extension)!! Please note that all scores are positionindependent along the sequence

18 Pairwise alignment Given 2 sequences what is an alignment with a maximum score? Naïf solution: try every possible alignments and select one with the best score Using the score function :

19 How may alignments there are? Case without internal gaps AAA BB AAA AAA AAA AAA BB BB BB BB AAA BB Given 2 sequences of length m e n, the number of shifts is m +n +1

20 How may alignments there are? Case with internal gaps AAA BB BBAAA AAA AAA AAA A AA BB B B B B BB BABAA BAABA BAAAB ABBAA AAA BB ABABA AAA B B ABAAB AA A BB AABBA AAA BB AABAB AAA BB... AAABB The number of the alignments is equal at the number of all possible way of mixing 2 sequences keeping track of the original sequence order. Given 2 sequences of lengths n and m, they are k=,min(m,n)(m+n-k)!/k!(n-k)!(m-k)! For n=m=8 we get > 143 possible alignments!!!!!!!

21 Basic idea We can keep a table of the precomputed substring alignments (dynamic programming) ALSKLASPALSAKDLDSPALS ALSKIADSLAPIKDLSPASLT ALSKLASPALSAKDLDSPAL-S ALSKIADSLAPIKDLSPASLTALSKLASPALSAKDLDSPALSALSKIADSLAPIKDLSPASL-T

22 Basic idea Building step by step Given the 2 sequences ALSKLASPALSAKDLDSPALS, ALSKIADSLAPIKDLSPASLT The best alignment between the two substrings ALSKLASPA ALSKIAD can be computed taking only into consideration ALSKLASP + A ALSKIA D ALSKLASP A + ALSKIAD - ALSKLASPA + ALSKIA D The best among these 3 possibilities

23 The Needleman-Wunsch Matrix yn y1 x1 xm Every nondecreasing path from (,) to (M, N) corresponds to an alignment of the two sequences

24 The Needleman-Wunsch Algorithm m = 1 x = AGTA y = ATA s = -1 d = -1 F(i,j) i = 4 j= A G T A A T A Optimal Alignment: F(4,3) = 2 AGTA A - TA

25 The Needleman-Wunsch Algorithm Initialization. F(, ) = F(, j) =-j d F(i, ) =-i d Main Iteration. Filling-in partial alignments For each i = 1 M For each j = 1 N F(i-1,j) d F(i, j) = max F(i, j-1) d F(i-1, j-1) + s(xi, yj) [case 1] [case 2] [case 3] UP, if [case 1] Ptr(i,j) = LEFT if [case 2] DIAG if [case 3] Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

26 Exercise: Suppose you want only to know the score of a global alignment. => Write a program that given two input sequence (in a single file in fasta format), a gap cost and a similarity matrix computes the score of the global alignment in O(N*M) time and in O(M) space, where M and N are the lengths of the input sequences and M<=N

27 The Overlap Detection variant ym y1 x1 xn Changes: Initialization For all i, j, F(i, ) = F(, j) = Termination maxi F(i, N) FOPT = max maxj F(M, j)

28 The local alignment problem Given two strings x = x1 xm, y = y1 yn Find substrings x, y whose similarity (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc

29 Why local alignment Genes are shuffled between genomes Portions of proteins (domains) are often conserved

30 The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: F(, j) = F(i, ) = Iteration: F(i, j) = max F(i 1, j) d F(i, j 1) d F(i 1, j 1) + s(xi, yj)

31 The Smith-Waterman algorithm Termination: If we want the best local alignment FOPT = maxi,j F(i, j) If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

32 Scoring the gaps more accurately Current model: G(n) Gap of length n incurs penalty n d However, gaps usually occur in bunches Concave gap penalty function: G(n) G(n): for all n, G(n + 1) - G(n) G(n) - G(n 1)

33 General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk= i-1f(k,j) (i-k) maxk= j-1f(i,k) (j-k) Termination: same Running Time: O(N2M) Space: O(NM) (assume N>M)

34 Exercise: Write a program that given two input sequence (in a single file in fasta format), and a choice of a general gap function and scoring matrix computes the alignments of the two sequences and returns one of the possible best alignments. Remember that when you store that the best score is obtained using maxk= i-1f(k,j) g(i-k) maxk= j-1f(i,k) g(j-k) You have to store this information in the corresponding pointer (back-trace) matrix.

35 Compromise: affine gaps g(n) = d + (n 1) e gap gap open extend g(n) d To compute optimal alignment, At position i,j, need to remember best score if gap is open best score if gap is not open F(i, j): score of alignment x1 xi to y1 yj if xi aligns to yj G(i, j): score if xi, or yj, aligns to a gap e

36 Needleman-Wunsch with affine gaps Initialization: F(,)=, F(i, ) = d + (i 1) e, F(, j) = d + (j 1) e R(,j)= -, C(i,)= - Iteration: F(i, j) = max F(i 1, j 1) + s(xi, yj) R(i, j) C(i, j) F(i 1, j) d R(i, j) = max R(i 1, j) e F(i, j -1) d C(i, j) = max C(i, j -1 ) e Termination: same

37 Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: Then, xi yj # gaps(x, y) < k(n) implies ( say N>M ) i j < k(n) We can align x and y more efficiently: Time, Space: O(N k(n)) << O(N2)

38 Bounded Dynamic Programming y1 yn Initialization: x1 xm F(i,), F(,j) undefined for i, j > k Iteration: For i = 1 M For j = max(1, i k) min(n, i+k) F(i 1, j 1)+ s(xi, yj) F(i, j) = max k(n) Termination: F(i, j 1) d, if j > i k(n) F(i 1, j) d, if j < i + k(n) same Easy to extend to the affine gap case

39 Significance of an alignment Given an alignment with score S, is it significant? How are the score of random alignments distributed? Occorrenza 1, alignments of unrelated and shuffled sequences: Significant alignments Score

40 Z-score Z=(S-<S>)/ s S= Alignment score <S>= average of the scores on a random set of alignments s= Standard deviation of the scores on a random set of alignments Significance of the alignment Z<3 not significant 3<Z<1 probably significant Z>1 significant

41 Execise: Z-score Write a program that takes in input a fasta with two sequences, and a number N. Compute the score of the global alignment of the two sequence and the Z-score with respect N shuffled sequences (generated from the first of the fasta) against the original second sequence of the fasta. Z=(S-<S>)/ s S= Alignment score <S>= average of the scores on a random set of alignments s Standard deviation of the scores on a random set of alignments

42 Is the Z-score reliable? The Z-score of this alignment is 7.5 over 54 residues Sequence identity is as high as 25.9%. The sequences have a completely different structure Citrate synthase (2cts) vs transthyritin (2paba)

43 E-value Expected number of random alignments obtaining a score greater or equal to a given score (s) Relies on the Extreme Value Distribution E=Kmn e- s m, n: lengths of the sequences K, : scaling constants The number of high scoring random alignment increases when the sequence lengths increase and decreases in an exponential way when the score increases.

44 E-value Alignment significance The significance of the E-value depends on the length of the considered database. Considering Swiss Prot, E> 1-1 non significant 1-1 > E > 1-3 probably not significant 1-3 > E > 1-8 probably significant E < 1-8 significant

45 P-value Probability for random alignments to obtain a score greater or equal to a given score (s) Given the E-value (expected number of alignments), which statistics do describe the probability of having a number a of random alignments with score S? Poisson: P(a) = Which is the probability of finding at least one random alignment with score S? P(a 1) = 1 P() = 1 exp (-E)

46 Similarity search in Data Bases Given a sequence, search for similar sequences in huge data sets In principle, the alignments between the query sequence and ALL the sequences in the data sets could be tried Too many sequences! Heuristic algorithms can be used. They do not assure to find the optimal alignment FASTA BLAST

47 FASTA The query sequence is chopped in words of k-tup characters. Usually k-tup = 2 for proteins, 6 for DNA ADKLPTLPLRLDPTNMVFGHLRI Words (indexed by position): AD, DK, KL, LP, PT, TL, LP, PR, RL,,, The list of indexed words is compiled for each sequence in the data set (subject) The search of the correspondence between the words is very fast. The difference between the indexes of the matches in the query and the subject sequences determines the distance from the main diagonal

48 FASTA Subject Query Many matches along the same diagonal correspond to longer identical segments along the sequences

49 FASTA Subject Query The alignment of the longest matched diagonals are evaluated with a score matrix (PAM or BLOSUM)

50 FASTA Subject Query Most similar regions on close diagonals are isolated

51 FASTA Subject Query An exact Smith-Waterman alignment is computed on a narrow band around the diagonal endowed with the highest similarity (a 32-residue band is usually adopted)

52 Sequence similarity with FASTA

53 BLAST The sequence data set is indexed as follow: for each possible residue triplet the occurrence and the position along the sequences are stored. (FORMATDB) AAA AAC AAD ACA

54 BLAST The query sequence is chopped in words, W-residue long (usually W=3 for proteins) LSHLPTLPLRLDPTNMVFGHLRI LSH, SHL, HLP, LPT, PTL, TLR,,, For each word, all the similar proteins are generated, using the BLOSUM62 matrix and setting a similarity threshold (usually T = ) LSH ISH MSH VSH LAH LTH LNH

55 BLAST Each word included in the list of the similar words is retrieved in the sequence of the data set by means of the indexes The match is extended, until the score is higher than a threshold S

56 Sequence similarity with BLAST (Basic Local Alignment Search Tool)

57 Alignment of all the retrieved sequences TTENTION: It is not a multiple sequence alignment

58 Sequence profile Y Y Y Y Y Y Y G T T K R R R R K R Y K K D D D D D D D G G G Y Y Y Y Y Y Y F Y Y H Q Q V Q N Q G G G S T S S F T T F F G G D D D D D H D L L L K Q H H Q Q H I I I K K K K K K K K K K K K K K K K K N N N G G G G G N A T T T E D E E S E D E E E L L L L L S L T T T T T T K K K Position A C D E F G 1 H K I L M N P Q R S T 2 V W Y

59 Usefulness of the sequence profiles Sequence profiles describes the basic features of all the sequence used in the alignment Most conserved regions and most frequent mutations for each position are highlighted Sequence-to-profile alignment The alignment score are weighted position by position using the profile. The same mutations in different positions are scored with different values

60 Sequence-to-profile alignment Given the position i along a sequence profile, it is represented by a 2-valued vector Pi = Pi(A) Pi(C) Pi (Y) Given the residue in position j along the sequence to align: Sj The score for aligning Sj to the vector Pi is: where M is a matrix score (BLOSUM or PAM) How to score a profile-profile alignment?

61 PSI-BLAST Sequence BLAST Sequence profile Data Base PSI-BLAST Until converges

62 The design of PSI-BLAST PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly. PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments. Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

Alignment ABC. Most slides are modified from Serafim s lectures

Alignment ABC. Most slides are modified from Serafim s lectures Alignment ABC Most slides are modified from Serafim s lectures Complete genomes Evolution Evolution at the DNA level C ACGGTGCAGTCACCA ACGTTGCAGTCCACCA SEQUENCE EDITS REARRANGEMENTS Sequence conservation