EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introducton to Bonformatcs Sequence Algnment Luke Huan Electrcal Engneerng and Computer Scence http://people.eecs.ku.edu/~huan/

HMM Π s a set of states Transton Probabltes a kl Pr( l 1 k Probablty of transton from state k to state l Emsson Probabltes e k Probablty of emttng character b n state k HMM topology A fully connected graph (.e. clque) contans too many parameters ( b) Pr( x b k ) ) 2011/10/22 EECS 730 2

HMM Π = { S, begn, end} = [1,2] a kl Pr( l 1 k a 11 : 0.5 0: 0.8 1: 0.2 S 1 ) e k a 1e : 0.2 ( b) Pr( x b k ) Begn a 12 : 0.3 a 21 : 0.7 End a 22 : 0.2 S 2 0: 0.1 1: 0.9 a 2e : 0.1 2011/10/22 EECS 730 3

Components of profle HMMs From bonformatcs D I Begn M End The transton structure of a profle HMM. 2011/10/22 EECS 730 4

Trval questons: What s the probablty that we wll observe the state-path (path) b, S 1, S 2, e? Gven a path b, S 1, S 2, e, what s the probablty that we wll observe the sequence 01? 0: 0.8 1: 0.2 a 11 : 0.5 S 1 a 12 : 0.3 Begn a 21 : 0.7 S 2 a 22 : 0.2 0: 0.1 1: 0.9 a 1e : 0.2 a 2e : 0.1 End 2011/10/22 EECS 730 5

Slghtly nvolved questons: What s the probablty that we wll observe the sequence 01 by gong through the path b, S 1, S 2, e? What s the probablty that we wll observe the sequence 01 wth M? What s the most lkely path, when we observe the sequence 01 from M? 0: 0.8 1: 0.2 a 11 : 0.5 S 1 a 12 : 0.3 Begn a 21 : 0.7 S 2 a 22 : 0.2 0: 0.1 1: 0.9 a 1e : 0.2 a 2e : 0.1 End 2011/10/22 EECS 730 6 M

A hard queston: Gven a set of sequences (assumng they are generated by a HMM), how do we estmate the parameters (and the structure) of the related HMM? 2011/10/22 EECS 730 7

Why do we care? Assgn membershp Gven a HMM M, buldng from a proten famly P, and a new sequence s, the probablty P(s M) tells us how lkely the sequence s belongs to P and hence have the same functon as protens n P. Questons: what f we have two famles P 1 and P 2 and we are not sure whch famly I should assgn the sequence to? 2011/10/22 EECS 730 8

Why do we care? Fnd the algnment Gven a HMM M, buldng from a proten famly P, and a new sequence s, the most lkely path of events T = max P(s P) (P s a vald path n M) tells us how should we algn s to M. 2011/10/22 EECS 730 9

Why do we care? Buld a HMM Gven a set of proten sequences S, buld the HMM that mostly lkely generates S. 2011/10/22 EECS 730 10

Three Important Questons How lkely s a gven sequence? The Forward algorthm What s the most probable path for generatng a gven sequence? The Vterb algorthm How can we learn the HMM parameters gven a set of sequences? The Forward-Backward (Baum-Welch) algorthm 2011/10/22 EECS 730 11

Searchng wth profle HMMs Man usage of profle HMMs Detectng potental membershp n a famly Matchng a sequence to the profle HMMs Vterb algorthm Based on Dynamc Programmng Mantanng log-odd rato compared wth random model P ( x R) q x Show Desktop.scf 2011/10/22 EECS 730 12

Vterb Algorthm The best way to get to E s ether: To go to N5 va the best way to t from S and then to E, or To go to N6 va the best way to t from S and then to E, or To go to N7 va the best way to t from S and then to E. The best way to get to N5 s ether: To go to N2 va the best way to t from S and then to N5 etc., etc., In practce: Calculate best route to N1, then N2, N3, N4, N5, N6, N7 & E N2 N5 N1 N4 N7 S N3 N6 E 2011/10/22 EECS 730 13

2011/10/22 EECS 730 14 Vterb equaton ; log ) (, log ) (, log ) ( max ) ( ; log 1) (, log 1) (, log 1) ( max ) ( log ) ( ; log 1) (, log 1) (, log 1) ( max ) ( log ) ( D D D 1 D I I 1 D M M 1 D I D D I I I I M M I I M D D 1 M I I 1 M M M 1 M M 1 1 1 1 1 1 a V a V a V V a V a V a V q x e V a V a V a V q x e V x x =0

Example Calculaton N2 N5 1 1 0.1 0.1 N1 N7 0 0.1 0 0.1 0.1 N4 0.1 0.1 0.9 0.1 0.2 N3 N6 S 0.7 0.5 0.01 E Best path to N1 scores max{0.1*0.1} = 0.01 from S Best path to N2 scores max{ 0.01 * 1, 0.2*1} = 0.2 from S Best path to N3 scores max{0.7*0.5, 0.01 *0.9 *0.5} = 0.035 from S Best path to N4 scores max{ 0.035 *0.1, 0.2 *0,1} = 0.02 from N2 and so on As wth Needleman-Wunsch, we must record the nodes from whch the best path came 2011/10/22 EECS 730 15

HMMs from multple algnments Key dea behnd profle HMMs Use the same structure, wth dfferent transton and emsson probabltes, to capture specfc nformaton about each poston n the multple algnment of the whole famly Model representng the consensus for the famly Not the sequence of any partcular member HBA_HUMAN...VGA--HAGEY... HBB_HUMAN...V----NVDEV... MYG_PHYCA...VEA--DVAGH... GLB3_CHITP...VKG------D... GLB5_PETMA...VYS--TYETS... LGB2_LUPLU...FNA--NIPKH... GLB1_GLYDI...IAGADNGAGV... *** ***** Ten columns from the multple algnment of seven globn proten sequences. The starred columns are ones that wll be treated as matches n the profle HMM. 2011/10/22 EECS 730 16

Multple algnment by profle HMM tranng- Multple algnment wth a known profle HMM Before we estmate a model and a multple algnment smultaneously we consder the smpler problem of obtanng a multple algnment from a known model. When we have a multple algnment and a model of a small representatve set of sequences n a famly, and we wsh to use that model to algn a large member of other famly members altogether. 2011/10/22 EECS 730 17

Multple algnment by profle HMM tranng- Multple algnment wth a known profle HMM We know how to algn a sequence to a profle HMM- Vterb algorthm Constructon a multple algnment ust requres calculatng a Vterb algnment for each ndvdual sequence. Resdues algned to the same profle HMM match state are algned n columns. 2011/10/22 EECS 730 18

Multple algnment by profle HMM tranng- Multple algnment wth a known profle HMM Importance dfference wth other MSA programs Vterb path through HMM dentfes nserts Profle HMM does not algn nserts Other multple algnment algorthms algn the whole sequences. HMM doesn t attempt to algn resdues assgned to nsert states. The nsert state resdues usually represent part of the sequences whch are atypcal, unconserved, and not meanngfully algnable. Ths s a bologcally realstc vew of multple algnment 2011/10/22 EECS 730 19

Example Algnment, gven a learned HMM for 3 sequences ACSA AST ACCST best path: best path: best path: D C S A C S A E S A S T E S A C S T E MSA so far: MSA so far: MSA so far: ACGA ACSA AC-SA A-ST A--ST ACCST 2011/10/22 EECS 730 20

Another Example ATSA ACCA ACAST best path: best path: best path: D S A T S A E S A C A E C A S A C S T E MSA so far: MSA so far: MSA so far: ATSA AT-SA AT--GA A-CCA A-C-CA AC-AST 2011/10/22 EECS 730 21

HMMs from multple algnments Basc profle HMM parameterzaton Am: makng the dstrbuton peak around members of the famly Parameters the probabltes values: emsson probabltes, transton probabltes length of the model: heurstcs or systematc way 2011/10/22 EECS 730 22

Tranng from an exstng algnment Start wth a predetermned number of states n your HMM. For each poston n the model, assgn a column n the multple algnment that s relatvely conserved. Emsson probabltes are set accordng to amno acd counts n columns. Transton probabltes are set accordng to how many sequences make use of a gven delete or nsert state. a kl A kl k k A ' ' ( ') l kl E a' k a 2011/10/22 EECS 730 23 e ( a) E ( a)

More on estmaton of prob. (1) Maxmum lkelhood (ML) estmaton gven observed freq. c a of resdue a n poston. M ( a) e Problem of ML estmaton c a' a' If observed cases are absent? a c Specally when observed examples are somewhat few. 2011/10/22 EECS 730 24

More on estmaton of prob. (2) Smple pseudocounts q a : background dstrbuton A: weght factor c a Aq em ( a) A c a' a' Laplace s rule: Aq a = 1 a 2011/10/22 EECS 730 25

A smple example Chose sx postons n model. Hghlghted area was selected to be modeled by an nsert due to varablty. 2011/10/22 EECS 730 26

Profle HMM tranng from unalgned sequences Harder problem estmatng both a model and a multple algnment from ntally unalgned sequences. Intalzaton: Choose the length of the profle HMM and ntalze parameters. Tranng: Estmate the model usng the Baum-Welch algorthm or the Vterb alternatve. Multple Algnment: Algn all sequences to the fnal model usng the Vterb algorthm and buld a multple algnment as descrbed n the prevous secton. 2011/10/22 EECS 730 27

Profle HMM tranng from unalgned sequences Intal Model The only decson that must be made n choosng an ntal structure for Baum-Welch estmaton s the length of the model M. A commonly used rule s to set M be the average length of the tranng sequence. We need some randomness n ntal parameters to avod local maxma. 2011/10/22 EECS 730 28

Fnd approprate parameters Baum-Welch algorthm Instance of EM (Expectaton-Maxmzaton) algorthms. Flow of the B-W algorthm The update always ncrease Set ntal parameters the lkelhood P(X ), where at random. X s a set of sequences. Update parameters Increase of lkelhood< yes no Lkelhood P(X ) # updates Output parameters 2011/10/22 EECS 730 29

Fnd approprate parameters The Vterb alternatve Start wth a model whose length matches the average length of the sequences and wth random emsson and transton probabltes. Algn all the sequences to the model. Use the algnment to alter the emsson and transton probabltes Repeat. Contnue untl the model stops changng 2011/10/22 EECS 730 30

Multple algnment by profle HMM tranng Avodng Local maxma Baum-Welch algorthm s guaranteed to fnd a LOCAL maxma. Models are usually qute long and there are many opportuntes to get stuck n a wrong soluton. Multdmensonal dynamc programmng fnds global optma, but s not practcal. Soluton Start agan many tmes from dfferent ntal models. Use some form of stochastc search algorthm, e.g. smulated annealng. 2011/10/22 EECS 730 31

Profle HMM tranng from unalgned sequences Advantages: You take full advantage of the expressveness of your HMM. You mght not have a multple algnment on hand. Dsadvantages: HMM tranng methods are local optmzers, you may not get the best algnment or the best model unless you re very careful. Can be allevated by startng from a logcal model nstead of a random one. 2011/10/22 EECS 730 32

Profle HMM Summary Advantages: Very expressve proflng method Transparent method: You can vew and nterpret the model produced A consstent theory behnd gap and nserton scores Very effectve at detectng remote homolog Dsadvantages: Slow full search on a database of 400,000 sequences can take 15 hours Have to avod over-fttng and locally optmal models 2011/10/22 EECS 730 33