MINIMUM DESCRIPTION LENGTH BASED PROTEIN SECONDARY STRUCTURE PREDICTION

Size: px

Start display at page:

Download "MINIMUM DESCRIPTION LENGTH BASED PROTEIN SECONDARY STRUCTURE PREDICTION"

Harry Tyler
5 years ago
Views:

1 MINIMUM DESCRIPTION LENGTH BASED PROTEIN SECONDARY STRUCTURE PREDICTION Andrea Hategan and Ioan Tabus Insttute of Sgnal Processng, Tampere Unversty of Technology P.O. Box 553, FIN Tampere, Fnland phone: +(358) , fax: +(358) , emal: web: hategan, tabus ABSTRACT Ths paper ntroduces a new algorthm for predctng the secondary structure of a proten based on the proten s prmary structure,.e. ts amno acd sequence. The problem conssts n fndng the segmentaton of the ntal amno acd sequence, where each segment carres the label of a secondary structure, e.g., helx, strand, and col. Our algorthm s dfferent from other exstng probablstc nference algorthms n that t uses probablstc models sutable for drectly encodng the jont nformaton represented by the par (amno acd sequence, secondary structure labels), and chooses as wnner the secondary structure sequence provdng the mnmum representaton, or descrpton length, n lne wth the mnmum descrpton length prncple. An addtonal beneft of our approach s that we provde not only a secondary structure predcton tool, but also a tool that s able to compress n an effcent manner the jont sequences that defne the prmary and secondary structure nformaton n protens. The prelmnary results obtaned for predcton and compresson show a good performance, whch s better n certan aspects than that of comparable algorthms. 1. INTRODUCTION Protens are essental molecules to sustan lfe n all lvng organsms. The process by whch a proten s foldng n ts three dmensonal shape s a prerequste for a proten to be able to perform ts bologcal functon. Because many dseases are the consequence of some protens falure to adopt ther 3-D functonal shape (.e. proten msfoldng [1]) t s mportant to understand the rules by whch the protens fold n the three dmensonal shape. The three dmensonal shape of a proten, known as the tertary structure, s nduced by ts prmary structure,.e. the amno acd sequence, and the envronmental condton. It s a lot easer to expermentally determne the prmary structure of a proten than t s to determne ts tertary structure, and thus the methods for predctng the tertary structure from the prmary structure are a current topc of research. Presently, there s a huge gap between the number of protens for whch the prmary structure s known and the number of protens for whch the tertary structure, and thus ther functon, s known. As a precondton for foldng n the three dmensonal shape, protens form some local conformatons, e.g. alphahelces and beta-strands, called the secondary structure, Ths work was supported by the Academy of Fnland (applcaton number , Fnnsh Programme for Centers of Excellence n Research ) and the Graduate school n Electroncs, Telecommuncaton and Automaton (GETA). whch fnally are essental for acqurng the three dmensonal shape. Because these local conformatons mpose geometrcal constrants on the three dmensonal shape of a proten, the predcton of the secondary structure from the amno acd sequence s also an mportant stage n the process of predctng the tertary structure. In ths paper the am s to predct such local conformatons from the amno acd sequence,.e. for each amno acd n a gven proten sequence to predct a class label that specfes what s the type of secondary structure to whch t belongs. The feld of proten secondary structure predcton from amno acd sequences has developed durng four decades resultng n a large number of methods whch can be classfed n three generatons [2]. The frst generaton methods, that appeared n 1960s and 1970s, evaluated the lkelhood of dfferent amno acds to form dfferent local conformatons. The second generaton methods were domnatng untl early 1990s and used the lkelhood of groups of adjacent amno acds to form dfferent secondary structures. The thrd generaton methods, of the last decade, addtonally consder the evolutonary nformaton contaned n multple algnments for refnng the predcton of the secondary structure. A multple algnment can be bult between several homologous protens,.e. protens that have smlar known functon, but have dfferent amno acd sequences. Usng such an algnment, one can buld dstrbuton profles for each poston n the sequence, thus provdng a stronger dscrmnatve nformaton, by accountng for the varablty whch does not affect the functon of a proten (thus beng also less mportant for ts 3-D shape). Varous predctors can be bult to use the dstrbuton profles, e.g., based on two layer neural networks. Although the accuracy of the methods usng evolutonary nformaton s hgher than that of the methods that are usng only the amno acd sequence, the predcton of the secondary structure from amno acd sequence alone s stll an mportant task because there are many orphan protens [3] for whch one needs to determne the secondary structure. For an orphan proten the evolutonary nformaton s not avalable, because ts amno acd sequence s not sgnfcantly smlar to that of protens wth known secondary structure and functon. The most nvestgated approach for proten secondary structure predcton s to consder a sldng wndow of typcally 15 amno acds and to predct the class label of the central amno acd knowng only the neghborng amno acds. The man problem of ths approach s that t cannot capture amno acd correlatons at a greater dstance than the wndow length. For ths reason, sldng wndow methods have a poor predcton accuracy for beta strands, whch have to form hy-

2 MWMPPRPEEVARKLRRLGFVERMAKGGHRLYTHPDGR LLLLLLHHHHHHHHHHHLLLEEEEELLEEEEELLLLL k=7 S={L,H,L,E,L,E,L} L ={6,11,3,5,2,5,5} Fgure 1: Example of a proten sequence wth n = 37 amno acds and ts secondary structure annotaton drogen bonds wth other beta strands stuated at a larger dstance than the length of the sldng wndow, n order to create beta sheets n the foldng process. The problem of secondary structure predcton can be formulated also as a segmentaton problem, where one starts from the prmary structure sequence and determnes: the number of segments, the class label of each segment and the length of each segment. Ths approach has been already nvestgated n the BSPSS [4] and IPSSP [3] algorthms. In these algorthms the goal s to fnd the best segmentaton by maxmzng the posteror probablty of the segmentaton gven the proten sequence, va Bayesan nference. In our algorthm, called mdlpssp (mnmum descrpton length proten secondary structure predcton), the best segmentaton s chosen va MDL prncple [5], by selectng the hypothess,.e. the segmentaton, that yelds the shortest descrpton length of the data and of the model parameters. The probablstc models and the searchng method used n our algorthm are dfferent than those used n [4] and [3], beng addtonally sutable for real encodng of the jont prmary and secondary structure nformaton. The paper s organzed as follows: n the next secton the problem of proten secondary structure predcton s formulated; the thrd secton presents the man features, modellng, codng, and searchng, of the mdlpssp algorthm. The expermental results are presented n the fourth secton, whle the conclusons are drawn n the last secton. 2. PROBLEM STATEMENT We need to predct the secondary structure annotaton of a gven proten sequence X n = {x 1...x n },.e. to fnd the trplet (k,s,l ) such that (k,s,l ) = mn (k,s,l ) CL(X n k,s,l ) (1) where k s the number of secondary structure segments, S = {s 1...s k } specfes the secondary structure type for each segment, L = {l 1...l k } specfes the length of each segment such that k =1 l = n, and CL(X n k,s,l ) s the cost of encodng the proten sequence X n when the secondary structure annotaton s gven by (k,s,l ). An example llustratng the notatons s gven n Fgure 1 where the amno acd sequence of a proten s shown n the frst row and ts secondary structure annotaton n the second row. The amno acd sequence represents the nput data of the algorthm. The letters n the second row specfy the secondary structure class label for each amno acd n the frst row. The secondary structure nformaton can be represented by the trplet (k,s,l ) and ths trplet s the output of the algorthm. In order to be able to compute the codelength of a proten sequence X n gven ts secondary structure annotaton (k,s,l ) we have to choose a set of probablstc models, that wll be denoted by M. In order to deal wth sequentally computable probabltes, we are gong to assume a parametrc frst order dependency, p(s s 1,M ), for the class labels, and a parametrc dependency of the amno acd probablty, gven ther class labels, such that the crteron n (1) wll decouple as: CL(X n k,s,l ;M ) = k =1 CL(X l s,s 1 ;M ) (2) For a gven (hypotheszed) secondary structure segmentaton, one can use a procedure to compute the cost of the secondary structure segments CL(X l s,s 1 ;M ), and add all results nto the overall cost (2). Our algorthm wll conssts of a search through the space of all segmentatons n order to fnd the one havng the mnmum assocated cost (1). The adopted model makes the search feasble by usng the dynamc programmng method. The next secton presents the set of models M used, the procedure to compute the cost of encodng a secondary structure segment CL(X l s,s 1 ;M ) and the searchng algorthm for the best segmentaton. 3. THE MDLPSSP ALGORITHM As n any machne learnng algorthm we wll defne the procedures for the two steps of tranng and testng. In the tranng step, the algorthm adjusts the parameters and structure wthn the adopted classes of parametrc models, n order to acqure knowledge about the dependences expected between secondary and prmary structures. In the testng step, the algorthm s provded wth some new proten prmary sequences, dfferent than the ones gven n the tranng step, and s asked to provde the segmentaton, or secondary structure, whch s subsequently assessed aganst the true one. For the proten secondary structure predcton problem the tranng step mples choosng the structure of the parametrc probablstc models M and adjustng the parameter values of these models based on the average codelength resulted for the observed segments of a gven secondary structure class. The ssues concernng the tranng step are dscussed n subsecton 3.1. In the testng step, the secondary structure of a new proten s predcted usng the models developed n the tranng step, through a searchng algorthm descrbed n 3.2. The predcted secondary structure s then compared wth the true secondary structure annotaton n order to assess the performance of the algorthm. 3.1 Codng and modellng We start by descrbng how the codelength of a gven secondary structure segment CL(X l s,s 1 ;M ) s calculated. In a typcal data compresson framework the encoder wants to send some data to the decoder by creatng a btstream that has to contan all the nformaton needed for the decoder to be able to recover the orgnal data. When the encoder has to send one secondary structure segment X l to a decoder, the encoder uses a set of models M to create a btstream that encodes the followng nformaton: the type s, the length l, and the amno acd sequence X l = {x 1...x l }. Usng the btstream and the same set of models M, the decoder s able to decode from the btstream all the nformaton needed to recover the sent secondary structure segment. The length of the resulted btstream s the cost of encodng the gven secondary structure segment usng the set of models M,.e. CL(X l s,s 1 ;M ).

3 Let M = {M S,M H,M E,M L } be the set of followng models: M S specfes the transton between consecutve secondary structure labels; M H,M E,M L are condtonal (tree) models for the amno acd beng n one of the three secondary structure types, helx H, strand E, and col L, respectvely. Then, usng the arthmetc codes [6], one can encode n the btstream all nformaton, wth the cost of segment gven by: CL(X l l +1 s,s 1 ;M ) = log 2 P(s s 1 ;M S ) log 2 P(x j c j ;M s ) (3) where M S s a frst order Markov model that gves the probablty of observng a segment of type s followng a segment of type s 1 and M s s a Tree Machne (TM) [7] that gves the probablty of observng the amno acd x j n the context c j n a segment of type s. The summaton n (3) has upper lmt l + 1 snce the encoder has to notfy when a segment ends, by sendng the STOP symbol. Thus, the models M H,E,L nclude along wth the standard amno acds the STOP symbol yeldng an alphabet A wth n a = A = 21 symbols. A TM s bult for each secondary structure type. Frst, a maxmal tree machne (MTM) s bult from a tranng set for each type and then the maxmal trees have to be pruned n order to dscrmnate as much as possble between the three secondary structure types. Encodng wth a TM, mples selectng the context,.e. a certan node n the tree, n whch a symbol s encoded. All these ssues are dscussed next. The tree machnes we use here are data structures smlar to those ntroduced by Rssanen n [7] that keep a sorted lst of pars (context, symbol). The contexts can be seen as nodes n a tree, whle the symbols formng a context defne a path from the root to the node. Each node n the tree stores a probablty dstrbutons of the symbols that were seen at that node. In our applcaton the MTM s bult from a large set of short segments, where each segment has a label, H,E, or L. The constructon of the MTM s explaned by the followng example and t s llustrated n Fgure 2. Suppose that the tranng set s composed of two segments: aabacb and abac. Frst, we pad each segment wth the extra symbol, to mark ther begnnng and endng. Only for ths example, we assume that the alphabet s A = {a,b,c, }. Then, for each symbol n all segments we select the longest context n whch t appears,.e. the prefx of each segment up to that symbol (the left table n Fgure 2). The resulted prefxes are sorted from rght to left n lexcal order. The contexts that wll defne the MTM are selected such that the prefx unquely dentfes the assocated symbol or the begnnng of a segment has been reached (the rght table n Fgure 2). The resulted contexts are arranged n a tree such that each context defnes a leaf. When a context s added to the tree, the tree s clmbed startng from root and followng the branches defned by context symbols from rght to left. The frequency of the symbol pared wth the added context s ncreased n all nodes (contexts) encountered on the path from root to the leaf defned by the context. For example, for the par ( aba, c), the resulted context s ba and the assocated symbol s c. Ths par s added n the MTM as follows: ncrease the count for c n the root and then take the branch a. At the arrved node ncrease the count for c and take the branch b. At the arrved leaf ncrease the count for c. Snce a context defnes a node j=1 aabacb abac a a aa b aab a aaba c aabac b aabacb a b ab a aba c abac prefxes aabacb abac a a a b aa b aba c aaba c ab a aab a aabacb abac aabac b contexts Fgure 2: Maxmal Tree Machne buldng n the tree and a node defnes a context, the two terms can be used nterchangeably. In the encodng process, the context for the current symbol x j s selected by clmbng the tree startng from the root and then takng the branches defned by the prevous symbols x j 1,x j 2,... untl a leaf s reached or untl at the arrved node r = x j 1 x j 2...x j t there s no branch wth the label x j t 1. If the MTM s used to encode the tranng set, almost all symbols, except those at the begnnng of segments, wll be encoded usng the dstrbutons n leaves, where the dstrbutons are more relevant and gve better estmates for the probablty of the next symbol. If the MTM s to be used for a data set dfferent than the tranng set, the so called zero-frequency problem wll occur. Ths problem deals wth the stuaton when at the arrved node r, the probablty of the current symbol x j s zero. Ths means that n the tranng phase the current symbol x j has never been observed n the context r. Because our goal s to use the MTM for data dfferent than the tranng set, we have to assgn probabltes for all symbols n the alphabet at each node. Ths can be acheved by ntroducng the probablty for the escape symbol [8], P esc. In the encodng process, f n the selected context the probablty of the current symbol s zero, the encoder frst sends the escape symbol to nform the decoder that the current symbols has not been seen n the current context. After ths, the encoder specfes whch of the unseen symbols n the current context, actually s sent, usng an equprobable dstrbuton over the unseen symbols. By ntroducng the probablty for the escape symbol, the collected dstrbutons at each node wll be adjusted to nclude also ths probablty. Thus, f the probablty of a symbol x j n the context r s P(x j r), the adjusted probablty wll be ˆP(x j r) = (1 P esc )P(x j r). Although the longer contexts are more relevant and provde better probablty estmates for the probablty of the next symbol, the problem s that these longer contexts also appear less often than the shorter ones. Thus, the statstcs kept n longer context accumulate slower. Even f the zerofrequency problem has been solved by ntroducng the escape symbol, when encodng a new data set we would lke to avod long contexts that are ftted to the tranng set, because here t s very lkely that the current symbol wll be encoded va the escape event, whch yelds a longer codelength than the one n a smaller context, where the symbol s more lkely to be observed. Then, the goal s to prune the MTM such that to get rde of those long contexts, (over-)ftted to the tranng set, where encodng s neffcent. After one maxmal tree MTM s has been created for each

4 of the three secondary types, s = {H,E,L}, the goal of prunng s to carve such trees, whch wll dscrmnate well between the three classes. From each MTM s, we have created a set of pruned tree machnes TM d s,d = 1 : D s that keep only the nodes from root untl depth level d, where D s s the maxmum level reached over the tranng set. The problem now s to fnd the optmum level d s yeldng the most dscrmnatve collecton of three trees. The resultng trees, TM d s s, and the dstrbuton collected at ther nodes represent the abstract models denoted earler M s. The calbraton process was carred out on a calbraton set, that s dfferent from both the tranng and testng sets. Denotng by C s the calbraton set contanng all amno acd segments for type s, the best value of ds was selected by the mnmzaton : ds = mn{cl(c s TMs j j )+ [CL(C s TMs j ) CL(C q TMs j )]} q s (4) Wth all the nformaton descrbed, we are able now to compute the codelength of a gven secondary structure segment (3) and further to compute the codelength of a gven proten sequence X n gven ts secondary structure annotaton (k,s,l ) (2). 3.2 Searchng In the prevous sectons we descrbed how to encode a gven proten sequence X n when ts secondary structure annotaton (k,s,l ) s also gven, based on the set of proposed models M. Next we move to the problem of actually fndng the secondary structure annotaton (k,s,l ) such that the resulted codelength s mnmzed (1). The exhaustve search through all possble segmentatons of the gven proten sequence for choosng the mnmzer n (1) s too expensve or even nfeasble for most of the proten sequences, whch have commonly hundreds of amno acds. For ths reason we resort to dynamc programmng, whch can be appled here due to the separablty of the crteron n (2). We have to defne the states and ther assocated costs, as well as the costs for a transton from one state to another. For mnmzng (2), a state can be defned as Z( j,k,s), where j represents the poston of the current amno acd x j n the gven proten sequence X n, k represents the number of segments up to poston j, and s represents the secondary structure label for the segment that ends at poston j. Each such state has assocated a cost that represents the codelength of the proten sequence X j, gven that there are k segments and the last one has the label s. From the state Z( j,k,s), the algorthm can make a transton to the new state Z(t,k + 1, p), wth the transton cost gven by CL(X j+1:t s, p;m p ). For each state we mark the state from whch we arrved wth the mnmum overall cost (cost of startng state plus cost of transton). For j = n, there are (n k mn ) 3 states, where k mn = n/l max and L max s the maxmum allowed segment. When arrvng at the states wth j = n, we select the one wth the mnmum cost and the best segmentaton s found by followng the ponters kept at each state, untl the state Z(0,0,0) s reached. 4. EXPERIMENTAL RESULTS In order to assess the performance of any predcton algorthm, the tranng set and the testng set should be dstnct. For the proten secondary structure predcton problem, the requrements are tradtonally more strngent,.e. the two data set should not contan sequences that share sgnfcant sequence smlarty. Ths condton s necessary snce keepng the sequences that share a certan degree of sequence smlarty (lkely to be homologous protens) wll artfcally lower the reported predcton errors. In our experments, we have used three data sets. For tranng, we have used the PSIPRED data set from for the calbraton process we have used the EVA data set from ftp://cubc.boc.columba.edu/pub/eva/unque lst.txt and for testng we have used the CASP6 targets data set vew.cg?loc=predctoncenter.org;page=casp6/. The tranng and testng sets are the same as n [3] for comparson purposes. All the data sets contan the followng nformaton for each proten: the sequence dentfer, the amno acd sequence and the secondary structure annotaton as defned by DSSP [9]. DSSP defnes eght secondary structure classes (H,I,G,E,B,S,T,-), but usually the eght classes are mapped to three classes as suggested n [10]: (H,G) H, (E,B) E and (I,S,T,-) L, whch we adopt also here. The performance of the predcton methods were compared n terms of per state predcton accuracy defned as Q (%) = N pred /N obs 100, where = {H,E,L,3} wth = 3 meanng the predcton over all three states, N obs s the number of observed amno acds n the secondary structure type and N pred s the number of amno acds correctly predcted n state. The performance of the new algorthm when workng as a compresson algorthm,.e. when the secondary structure s gven, was evaluated n terms of needed bts per symbol. Here, symbol s taken as the par: amno acd and secondary structure label. In ths case, one needs log log 2 3 = 5.9 bts per symbol on average to encode a proten sequence and ts secondary structure nformaton. Snce the secondary structure sequence s hghly correlated, we also consder two smple alternatves. In the frst one each amno acd s encoded usng log 2 20 bts per amno acd, whle for each secondary structure segment the class label s encoded usng log 2 3 and the length s encoded by Elas Gamma codes for ntegers. The second method, transforms the secondary structure annotaton n a strng of 0,1 and 2. The 0 s used to mark that the class label does not change, whle 1 and 2 are used to specfy where the class label changes and whch s the new label. The resulted strng s encoded usng the arthmetc codes [6]. The frst method s denoted Clear1 n Table 2, whle the second s denoted Clear2. Table 1 presents the results obtaned n the calbraton process on the EVA data set. The table s splt n three parts, one for each secondary structure type. The second column presents the average codelength when the segments from one class have been encoded wth dfferent pruned trees resulted from the maxmal trees traned on the same class (HH,EE,LL). The thrd and the forth column represent the average codelength when the segments from one class are encoded usng trees traned on dfferent classes. The expresson n (4) tells to choose the level that yelds the mnmum value n the last column, for each secondary structure type.

5 d H HH HE HL HH + (HH-HE) + (HH-HL) d E EE EH EL EE+(EE-EH)+(EE-EL) d L LL LH LE LL+(LL-LH)+(LL-LE) Table 1: Calbraton results on EVA set Algorthm Q H (%) Q E (%) Q L (%) Q 3 (%) PSIPRED BSPSS IPSSP mdlpssp Table 2: Predcton accuracy results: per state accuracy. Table 2 presents the results for the predcton accuracy. For the mdlpssp algorthm, the best results were obtaned when the pruned trees had: 3 levels for H, 2 levels for E and 2 levels for L. Note that from Table 1, ths corresponds to the best results when the segments of a certan type are encoded wth the tree traned on the same secondary structure type (column 2). The results n Table 2 n the frst three lnes, were taken from [3]. PSIPRED s a predcton algorthm that uses addtonally evolutonary nformaton, but for the results n ths table t was tested under sngle-sequence condton. The new algorthm mdlppss has the best performance for the helx class. The results are prelmnary snce we contnue to experment wth more refned prunng technques. On average, the predcton takes 22 seconds for one sequence on the CASP6 test set. Table 3 presents the results for the mdlpssp algorthm, when t works as a compresson algorthm for a proten sequence and ts secondary structure nformaton. The results are better for both data sets when compared wth two methods presented earler for the clear representaton. The column mc represents the cost for the set of models M f these should also be sent to the decoder. Ths s compulsory for the PSIPRED data set, because the models were traned on the same set. For the EVA set, t can be assumed that the models are generc, known to both the encoder and the decoder. The column sc shows the average codelength obtaned for encodng a proten sequence and ts secondary Data #seqs mc sc tc Clear 1 Clear 2 EVA PSIPRED Table 3: Compresson results: bts per symbol (amno acd and secondary structure label). structure annotaton. The column denoted by tc represents the total cost, for the model and the data. The last two methods show the average codelength obtaned for the two clear representatons. On average, the compresson takes 0.08 seconds for one sequence on the EVA set and 0.07 seconds for a sequence n the PSIPRED data set. The compresson results are compared wth the clear representaton because, to our knowledge, there s no other avalable algorthm for jontly compressng ndvdual protens and ther secondary structure nformaton. In a prevous work, we took advantages of the secondary structure annotaton n a compresson algorthm for amno acd sequences, [11], but n that algorthm we targeted compresson of full proteomes, whch are the collecton of all proten sequences n a gven organsm. The proteomes have length n the range of hundred thousand amno acds, and the approxmate repettons have a key role for achevng compresson, whch cannot be exploted n the current applcaton, so the method from [11] cannot be appled to the present dataset. 5. CONCLUSIONS Ths paper ntroduces a new algorthm, mdlpssp, useful n two dstnct applcatons: proten secondary structure predcton and jont compresson of prmary and secondary structure of protens. The prelmnary predcton results show better performance only for one of the classes, but the current prunng technque s stll not well adapted for predcton and a more refned prunng may lead to a better performance for all classes. The algorthm can work also as a compresson algorthm for ndvdual proten sequences and ther secondary structure nformaton, beng the frst such algorthm for these types of data. The compresson performance s sgnfcantly better when compared to a smple n clear representaton. REFERENCES [1] C.M. Dobson, Proten foldng and msfoldng, Nature, vol. 426, pp , Dec [2] B. Rost, Revew: proten secondary structure predcton contnues to rse, Journal of Structural Bology, vol. 134, pp , [3] Z. Aydn, Y. Altunbasak, and M. Borodovsky, Proten secondary structure predcton for a sngle sequence usng hdden sem-markov models, BMC Bonformatcs, vol. 7, no. 178, [4] S.C. Schmdler, J.S. Lu, D.L. Brutlag, Bayesan segmentaton of proten secondary structure, Journal of Computatonal Bology, vol. 7, no. 1/2, pp , [5] J. Rssanen, Modellng by the shortest data descrpton, Automatca, vol. 14, pp , [6] J. Rssanen, Generalzed Kraft nequalty and arthmetc codng, IBM Journal of Research and Development, vol. 20, no. 3, pp , [7] J. Rssanen, A unversal data compresson system, IEEE Trans. on Informaton Theory, vol. IT-29, no. 5, pp , [8] J. Rssanen, A lossless data compresson system, US Patent, no , [9] W. Kabsch, C. Sander, Dctonary of proten secondary structure: pattern recognton of hydrogen-bonded and geometrcal features, Bopolymers, vol. 22, pp , [10] B. Rost, C. Sander, Predcton of proten secondary structure at better than 70% accuracy, Journal of Molecular Bology, vol. 232, , [11] A. Hategan, I. Tabus, Jontly encodng proten sequences and ther secondary structre annotaton, n Proc. GENSIPS 2007, Tuusula, Fnland, June

Application of Maximum Entropy Markov Models on the Protein Secondary Structure Predictions

Application of Maximum Entropy Markov Models on the Protein Secondary Structure Predictions Applcaton of Maxmum Entropy Markov Models on the Proten Secondary Structure Predctons Yohan Km Department of Chemstry and Bochemstry Unversty of Calforna, San Dego La Jolla, CA 92093 ykm@ucsd.edu Abstract