A Hidden Markov Model Variant for Sequence Classification

Size: px

Start display at page:

Download "A Hidden Markov Model Variant for Sequence Classification"

Neal Dorsey
6 years ago
Views:

1 Proceedngs of the Twenty-Second Internatonal Jont Conference on Artfcal Intellgence A Hdden Markov Model Varant for Sequence Classfcaton Sam Blasak and Huzefa Rangwala Computer Scence, George Mason Unversty sblasak@gmu.edu, rangwala@cs.gmu.edu Abstract Sequence classfcaton s central to many practcal problems wthn machne learnng. Dstances metrcs between arbtrary pars of sequences can be hard to defne because sequences can vary n length and the nformaton contaned n the order of sequence elements s lost when standard metrcs such as Eucldean dstance are appled. We present a scheme that employs a Hdden Markov Model varant to produce a set of fxed-length descrpton vectors from a set of sequences. We then defne three nference algorthms, a Baum-Welch varant, a Gbbs Samplng algorthm, and a varatonal algorthm, to nfer model parameters. Fnally, we show expermentally that the fxed length representaton produced by these nference methods s useful for classfyng sequences of amno acds nto structural classes. 1 Introducton The need to operate on sequence data s prevalent n a varety of real world applcatons rangng from proten/dna classfcaton, speech recognton, ntruson detecton and text classfcaton. Sequence data can be dstngushed from the more-typcal vector representaton n that the length of sequences wthn a dataset can vary and that the order of symbols wthn a sequence carres meanng. For sequence classfcaton, a varety of strateges, dependng on the problem type, can be used to map sequences to a representaton that can be handled by tradtonal classfers. A smple technque nvolves selectng a fxed number of elements from the sequence and then usng those elements as a fxed-length vector n the classfcaton engne. In another technque, a small subsequence length, l, s selected, and a sze M l vector s constructed contanng the counts of all length l subsequences from the orgnal sequence. Ths vector can then be used for classfcaton [Lesle et al., 2002]. A thrd method for classfyng sequence data requres only a postve defnte mappng defned over pars of sequences rather than any Fundng: NSF III drect mappng of sequences to vectors. Ths strategy, known as the kernel trck, s often used n conjuncton wth support vector machnes (SVMs) and allows for a wde varety of sequence smlarty measurements to be employed. Hdden Markov Models (HMM) [Rabner and Juang, 1986; Eddy, 1998] have a rch hstory n sequence data modelng (n speech recognton and bonformatcs applcatons) for the purposes of classfcaton, segmentaton, and clusterng. HMMs success s based on the convenence of ther smplfyng assumptons. The space of probable sequences s constraned by assumng only parwse dependences over hdden states. Parwse dependences also allow for a class of effcent nference algorthms whose crtcal steps buld on the Forward- Backward algorthm [Rabner and Juang, 1986]. We present an HMM varant over a set of sequences, wth one transton matrx per sequence, as a novel alternatve for handlng sequence data. After tranng, the per-sequence transton matrces of the HMM varant are used as fxed-length vector representatons for each assocated sequence. The HMM varant s also smlar to a number of topc models, and we descrbe t n the context of Latent Drchlet Allocaton [Ble et al., 2003]. We then descrbe three methods to nfer the parameters of our HMM varant, explore connectons between these methods, and provde ratonale for the classfcaton behavor of the parameters derved through each. We perform a comprehensve set of experments, evaluatng the performance of our method n conjuncton wth support vector machnes, to classfy sequences of amno acds nto structural classes (fold recognton and remote homology detecton problem [Rangwala and Karyps, 2006]). The combnaton of these methods, ther nterpretatons, and ther connectons to pror work consttutes a new twst on classc ways of understandng sequence data that we beleve s valuable to anyone approachng a sequence classfcaton task. 2 Problem Statement Gven a set of N sequences, we would lke to fnd a set of fxed-length vectors, A 1...N, that, when used as nput to a functon f(a), maxmze the probablty of reconstructng the orgnal set of sequences. Under our scheme, 1192

2 f(a) s a Hdden Markov Model varant wth one transton matrx, A n, assgned to each sequence, and a sngle emssons matrx, B, and start probablty vector, a, for the entre set of sequences. By maxmzng the lkelhood of the set of sequences under the HMM varant model, we wll also fnd the set of transton matrces that best represent our set of sequences. We further postulate that ths maxmum lkelhood representaton wll acheve good classfcaton results f each sequence s later assocated wth a meanngful label. 2.1 Model Descrpton We defne a Hdden Markov Model varant that represents a set of sequences. Each sequence s assocated wth a separate transton matrx, whle the emsson matrx and ntal state transton vector are shared across all sequences. We use the value of each transton matrx as a fxed-length representaton of the sequence. We defne the parameters and notaton for the model n Table 1. Parameter N T n K M a A nj B m z nt x nt Descrpton the number of sequences the length of sequence n the number of hdden symbols the number of observed symbols start state probabltes, where s ndexed by the value of the frst hdden state transton probabltes, where n s an ndex of a tranng sequence, the orgnatng hdden state, and j the destnaton hdden state emsson probabltes, where ndcates the hdden state and m the observed symbol assocated wth the hdden state the hdden state at poston t n sequence n the observed symbol at poston t n sequence n Table 1: HMM Varant model parameters The jont probablty of the model s shown below: (1) p(x, z a, A, B) = N a zn1 A nznt 1 z nt B znt x nt n=1 t=1 Ths dffers from the standard hdden Markov model only n the addton of a transton matrx, A n (hghlghted n bold n Equaton 1), for each sequence, where the ndex n ndcates a sequence n the tranng set. Under the standard HMM, a sngle transton matrx, A, would be used for all sequences. To regularze the model, we further augment the basc HMM by placng Drchlet prors on a, each row of A, and each row of B. The pror parameters are the unform Drchlet parameters γ, α, and β for a, A, and B respectvely. The probablty of the model wth prors s shown below, where the pror probabltes are the frst three terms n the product below and take the form Dr(x; a, K) = Γ(Ka) Γ(a) K xa 1 : (2) p(x, z, a, A, B α, β, γ) = ( ) Γ(Kγ) a γ 1 ( ) Γ(Kα) Γ(γ) K A α 1 Γ(Mβ) Γ(α) K nj B β 1 Γ(β) M m n j m N a zn1 A nznt 1 z nt B znt x nt n=1 t=1 One potental dffculty that could be expected n classfyng smple HMMs by transton matrx s that the probablty of a sequence under an HMM does not change under a permutaton of the hdden states. Ths problem s avoded when we force each sequence to share an emssons matrx, whch locks the meanng of each transton matrx row to a partcular emsson dstrbuton. If the emsson matrx were not shared, then two HMMs wth permuted hdden states could have transton matrces that wth large Eucldean dstances. For nstance, the followng HMMs have dfferent transton matrces, but the probablty of an observed sequence s the same under each: [ HMM 1: A 1 = [ HMM 2: A 2 = ] [, B 1 = ], B 2 = [ However, a Eucldean dstance between ther two transton matrces, A 1 and A 2 s large. 3 Background 3.1 Mxtures of HMMs Smyth ntroduces a mxture of HMMs n [Smyth, 1997] and presents an ntalzaton technque that s smlar to our model n that an ndvdual HMM s learned for each sequence, but dffers from our model n that the emsson matrces are not shared between HMMs. In [Smyth, 1997], these ntal N models are used to compute the set of all parwse dstances between sequences, defned as the symmetrzed log lkelhood of each element of the par under the other s respectve model. Clusters are then computed from ths dstance matrx, whch are used to ntalze a set of K<NHMMs where each sequence s assocated wth one of K labels. Smyth notes that whle the log probablty of a sequence under an HMM s an ntutve dstance measure between sequences, t s not ntutve how the parameters of the model are meanngful n terms of defnng a dstance between sequences. In ths research, we demonstrate expermentally that the transton matrx of our model s useful for sequence classfcaton when combned wth standard dstance metrcs and tools. 3.2 Topc Models Smpler precursors of LDA [Ble et al., 2003] and plsi [Hofmann, 1999], whch represent an entre corpus of documents wth a sngle topc dstrbuton vector, are very smlar to the basc Hdden Markov Model, whch assgns a sngle transton matrx to the entre set of sequences that are beng modeled. To extend the HMM to a plsi analogue, all that s needed s to splt the sngle transton matrx nto a per-sequence transton matrx. To extend ths model to an LDA analogue, we must go a step further and attach Drchlet prors to the transton matrces, as n our model. Inference of the LDA model (Fgure 1a) on a corpus of documents learns a matrx of document-topc proba- ] ] 1193

3 bltes. A row of ths matrx, sometmes descrbed as a mxed-membershp vector, can be vewed as a measurement of how a gven document s composed from the set of topcs. In our HMM varant (Fgure 1b), a sngle transton matrx, A n, can be thought of as the analogue to a document-topc matrx row and can be vewed as a measurement of how a sequence s composed of pars of adjacent symbols. The LDA model also ncludes a topc-word matrx, whch ndcates the probablty of a word gven a topc assgnment. Ths matrx has the same meanng as the emssons matrx, B, n the HMM varant. The Fsher kernel [Jaakkola and Haussler, 1999] and the Probablstc Product Kernel [Jebara et al., 2004] (PPK), are prncpled methods that allow probablstc models to be ncorporated nto SVM kernels. The HMM varant s smlar to these methods n that t uses latent nformaton from a generatve model as nput to a dscrmnatve classfer. It dffers from these methods, however, both n whch portons of the generatve model that are ncorporated nto the dscrmnatve classfer and n the assumptons about how dfferences n generatng dstrbutons comparsons between tranng examples. 4 Learnng the model parameters 4.1 Baum-Welch A well-known method for learnng HMM model parameters s the Baum-Welch algorthm. The Baum-Welch algorthm s an expectaton maxmzaton algorthm for the standard HMM model, and the basc algorthm s easly modfed to learn the multple transton matrces of our varant. The parameter updates shown below converges to a maxmum a posteror (MAP) estmate of p(z, a, A, B x, γ, α, β) [Rabner and Juang, 1986]: (3) a n f n(1)b n(1) + γ 1 (a) (4) (5) A (new) nj B (new) m Tn f n(t 1)A njb jxt b nj(t) +α 1 f n(t)b nj(t) +β 1 n t:x t =m where f and b are the forward and backward recursons defned below: (6) { j f n(t) = fnj(t 1)AnjBx t, t > 1 a B x1, t =1 (b) Fgure 1: Plate dagrams of the (a) LDA model, expanded to show each word separately and the (b) HMM varant. The model parameters n the LDA model are defned as follows: K - number of topcs, φ k - a vector of word probabltes gven topc k, β - parameters of the Drchlet pror of φ k, θ n - a vector of topc probabltes n document n, α - parameters of the Drchlet pror of θ n. A row of the matrx B n the HMM varant has exactly the same meanng as a topc-word vector, φ k, n the LDA model. (7) b n(t) = { j AnjBjx bnj(t +1), t+1 t < Tn 1 K, t = Tn The complexty of the Baum-Welch-lke algorthm for our varant s dentcal to the complexty of Baum-Welch for the standard HMM. The update for A j n the orgnal HMM nvolves summng over n T n terms, whle the update for a sngle A nj s a sum over T n terms, makng the total number of terms over all the A n s n our varant, n T n, the same number as the orgnal algorthm. 4.2 Gbbs Samplng Two Gbbs samplng schemes are commonly used to nfer Hdden Markov Model parameters [Scott, 2002]. Unlke the Baum-Welch algorthm whch returns a MAP estmate of the parameters, these samplng schemes allow the expectaton of the parameters to be computed over the posteror dstrbuton p(z, a, A, B x, γ, α, β). In the Drect Gbbs sampler (DG), hdden states and parameters are ntally chosen at random, then new hdden states are sampled usng the current set of parameters: (8) p(z (new) t z t 1,z t+1) A zt 1 B xt A zt+1 In the Forward Backward sampler (FB), the ntal settngs and parameter updates are the same as the DG scheme, but the hdden states are sampled n order from T n down to 1 usng values from the forward recurson. Specfcally, each hdden state z nt s sampled gven z nt+1 = j from a multnomal wth parameters (9) (10) p(z (new) ntn x n1:tn ) fn(tn) p(z (new) nt x n1:tn,z (new) nt+1 )=p(z(new) nt x n1:t,z (new) nt+1 ) f n(t)a nj, t < T n In both algorthms, after the hdden states are sampled, parameters are sampled from Drchlet condtonal dstrbutons, shown for A below, where I(ω) =1f ω s true and 0 otherwse: (11) Tn p(a nj z n,α)=dr( I(z nt 1 = )I(z nt = j) +α) The FB sampler has been shown to mx more quckly than the DG sampler, especally n cases where adjacent hdden states are hghly correlated [Scott, 2002]. We therefore use the FB sampler n our mplementaton. 1194

4 4.3 Varatonal Algorthm Another approach for nference of the HMM varant parameters s through varatonal technques. We employ a mean feld varatonal algorthm that follows a smlar pattern as EM. When the varatonal update steps are run untl convergence, Kullback-Lebler dvergence between the varatonal dstrbuton, q(z, a, A, B), and the model s condtonal probablty dstrbuton, p(z, a, A, B x, γ, α, β), s mnmzed. The transton matrces returned by the varatonal algorthm are the expectatons of those matrces under the varatonal dstrbuton. Thus, lke the Gbbs samplng algorthm, the parameters returned by the varatonal algorthm approxmate the expectatons of the parameters under the condtonal dstrbuton. Our mean feld varatonal approxmaton s shown below: N K K (12) q(z, a, A, B) =q(a) q(a n) q(b ) q(z nt) n=1 =1 =1 nt ( Γ( = γ) ) a γ 1 Γ( j αnj) Γ( γ) n j Γ( αnj) A α nj 1 nj j ( Γ( β ) m m) m Γ( β B β m 1 m h z nt nt m) m nt wth varatonal parameters h nt, whch approxmate each z nt, and α nj, β m, and γ, whch can be thought of as Drchlet parameters approxmatng α, β, and γ. When we maxmze the varatonal free energy wth respect to the varatonal parameters, we obtan the followng update equatons, where Ψ(x) = d log Γ(x) dx : (13) α nj = h nt 1h ntj + α t (14) β m = h nt + β nt:x t =m (15) γ = h n1 + γ n ( (16) h nt exp h nt 1 Ψ( α n ) Ψ( α n j ) + j h nt+1 j Ψ( α ( nj) Ψ( α nj ) + Ψ( β xnt ) Ψ( )) β m), j j m Notce that the update for h nt depends only on the adjacent h s, h nt 1 and h nt+1 as well as the expectatons of the transton probabltes from the adjacent h s and the expectaton of the emsson probabltes from the current h nt. Ths mean feld algorthm can therefore be understood as an equvalent of the Drect Gbbs samplng method except that at subsequent tme steps nteractons occur between varatonal parameters rather than through the sampled values of z. A complete dervaton of the varatonal algorthm s ncluded on the authors webste Class categores, SCOP 1.67, 25% Baum Welch Gbbs Samplng Varatonal Fold categores, SCOP 1.67, 25% Baum Welch Gbbs Samplng Varatonal Fold categores, SCOP 1.67, 40% Baum Welch Gbbs Samplng Varatonal Superfamly categores, SCOP 1.67, 40% Baum Welch Gbbs Samplng Varatonal Table 2: AUC results from all of the mult-class SVM experments are dsplayed. The best performng algorthm, the best performng settng of K, and the best combnaton of K and algorthm s marked n bold. The Gbbs-Samplng-derved representaton most frequently returned the best AUC score on the majorty of the datasets. 5 Expermental Setup 5.1 Protocol To evaluate our fxed-length representaton scheme, for each dataset (descrbed n Secton 5.2), we created three sets of fxed-length representatons per tral over ten trals by runnng each of the three nference algorthms: () Baum-Welch, () Gbbs Samplng, and () the mean feld varatonal algorthm, on the entre set of nput data. We vared the number of hdden states, K, from 5 to 20 n ncrements of 5. Ths procedure created a total of 120 (3 10 4) fxed-length representatons for each dataset. The fxed-length vector data was then used as nput to a support vector machne (SVM) classfer 2. We used the SVM to ether perform ether multway classfcaton on the dataset under the Crammer-Snger [Crammer and Snger, 2002] constructon or the one-versus-rest approach, where a bnary classfer was traned for each of the classes. We compare classfcaton results from our model wth results from the Spectrum(2) kernel for all experments. The Spectrum(l) kernel s a strng kernel whose vector representaton s the set of counts of substrngs of observed symbols length l n a gven strng [Lesle et al., 2002]. For the one-versus rest experments, we compare our results to more bologcally senstve kernels for proten classfcaton, descrbed n Rangwala et. al [Rangwala and Karyps, 2005]. 5.2 Proten Datasets The Structural Classfcaton of Protens (SCOP) [Murzn et al., 1995] database categorzes protens nto a multlevel herarchy that captures commonaltes between proten structure at dfferent levels of detal. To evaluate our representaton, we ran sets of proten class- 2 We used SVM-lght and SVM-struct for classfcaton ( [Joachms, 1999]. 1195

5 fcaton experments on the three top levels of the SCOP taxonomy: class, fold, and superfamly. Our datasets, whch were obtaned from prevous studes [Rangwala and Karyps, 2006; Kuang et al., 2004], were derved from ether the SCOP 1.67 or the SCOP 1.53 versons and fltered at 25% and 40% parwse sequence denttes. A proten sequence dataset fltered at 25% dentty wll have no two sequences wth more than 25% sequence dentty. We parttoned the data nto a sngle test and tranng set for each category. At the class level, the orgnal dataset was splt randomly n to tranng and test sets. To elmnate hgh levels of smlarty between sequences that could lead to trvally good classfcaton results, we mposed constrants on the tranng/test set parttonng for classfcaton n the fold and superfamly experments. For the fold level classfcaton problem, the tranng sets were parttoned so that no examples that shared the fold and superfamly labels were ncluded n both the tranng and test sets. Smlarly, for the superfamly level classfcaton problem (referred to as the remote homology detecton problem [Lesle et al., 2002; Rangwala and Karyps, 2005]), no examples that shared the superfamly and famly levels were ncluded n both the tranng and test sets. 5.3 Evaluaton Metrcs We evaluated each classfcaton experment by computng the area under the ROC curve (AUC), a plot of the true postve rate aganst the false postve rate, constructed by adjustng the SVM s ntercept parameter. We also computed the AUC50 value, whch s a normalzed computaton of the area under the ROC curve untl the frst 50 false postves have been detected. We were worred about varance over dfferent Baum-Welch runs due to convergence of the algorthm to dfferent local optma. To mtgate ths concern, we ran both the Baum-Welch algorthm and the other nference algorthms, for consstency, 10 separate tmes on each dataset. The results presented for each nference method are averages over ndvdual results of the 10 trals across the dfferent classes. 6 Results and Dscusson 6.1 Proten Sequence Classfcaton Table 2 shows a comparson of results (average AUC scores) across the nference algorthms n three taxonomc categores (class, fold, and superfamly) usng the multclass SVM. Although the AUC scores are close for each algorthm, n most cases, the Gbbs samplng algorthm outperforms the other algorthms. Table 3 shows a comparson of results over the nference algorthms but only for the one-versus-rest superfamly classfcaton experment on the SCOP 1.53 dataset. Smlar to the multclass experments usng the lnear kernel, the Gbbs samplng algorthm outperforms the other nference methods n the one-versus-rest experments. Although the values of the best performng algorthm s AUC and AUC50 scores do not sgnfcantly change from the lnear to the Gaussan kernel, the var- Lnear Kernel Metrc AUC AUC Baum Welch Gbbs Samplng Varatonal Gaussan Kernel Metrc AUC AUC Baum Welch Gbbs Samplng Varatonal Table 3: AUC and AUC50 results for proten superfamly classfcaton AUC results on the SCOP 1.53 wth 25% Astral flterng over a selected set of 23 superfamles usng Gaussan and lnear kernels n one-versus-rest SVM classfcaton. atonal algorthm shows a large mprovement, rangng from 6% to 30%. 6.2 Analyss of nference algorthms The dfferences n AUC values resultng from the dfferent tranng algorthms (Tables 2 and 3) can be explaned, at least n part, by a hgh level overvew of how each algorthm operates. Whle the Baum-Welch algorthm returns MAP parameters of the model, both the Gbbs samplng method and the varatonal algorthm return expectatons of the parameters under an approxmate of the posteror dstrbuton. The MAP soluton from the Baum-Welch algorthm s lkely to reach a local maxmum of the posteror, whle the other algorthms should tend to average over posteror parameters. The Gbbs samplng algorthm and the varatonal algorthm each compute expectatons of the parameters under an approxmate posteror dstrbuton, but each uses a dfferent method to construct ths approxmaton. The varatonal algorthm wll be less lkely to converge to a good approxmaton of the margnal dstrbuton because the mean feld varatonal approxmaton necessarly does away wth the drect couplng between adjacent hdden states characterstc of the HMM. 6.3 Comparatve Performance Tables 4 and 5 show a comparson between the HMM varant and common classfcaton methods for the multclass and one-versus rest experments respectvely. The AUC and AUC50 scores ndcate that our scheme produces a representaton that s roughly equvalent n power to the Spectrum kernel for proten classfcaton. In defense of the HMM varant, the sze of the vector representaton produced by the spectrum kernel s sgnfcantly larger than the typcal representatons produced by our HMM varant. The Msmatch(5,1) kernel, used for SCOP 1.53 superfamly classfcaton (Table 5), s smlar to the Spectrum(5) kernel but also counts substrngs of length 5 that dffer by one amno acd resdue from those found n an observed sequence. The sze of the vector representaton assocated wth ths kernel can be up to Ths value s large compared to the largest vector representaton n our experments, whch s 400 for the HMM varant wth 20 hdden states. Nearly 1196

6 Dataset/Kernel HMM Varant Spectrum Class Fold (25 Categores) Fold (27 Categores) Superfamly Table 4: A comparson of results between the Spectrum kernel and the HMM varant under experments usng the multclass SVM formulaton. The HMM varant scores are the best performng from Table 2. Algorthm AUC AUC50 HMM Varant (best) Spectrum(2) [Lesle et al., 2002] Msmatch(5,1) [Lesle et al., 2003] Fsher [Jaakkola et al., 2000] SW-PSSM [Rangwala and Karyps, 2005] Table 5: A selecton of AUC and AUC50 scores for the Remote Homology Detecton problem usng a varety of SVM kernels on the SCOP 1.53, 25% dataset usng 1-vs-rest classfcaton. The HMM varant scores are the best performng from Table 3. all of these hgh-performng kernel methods, unlke the HMM varant, employ doman specfc knowledge, such as carefully tuned poston-specfc scorng matrces, to ad classfcaton. In contrast, the only parameter that needs to be adjusted n the HMM varant s the number of hdden states. 7 Conclusons and Future Work Our HMM varant s an extenson of the standard HMM that assgns ndvdual transton matrces to each sequence n a dataset but keeps a sngle emssons matrx for the entre dataset. We descrbe three nference algorthms, two of whch, a Baum-Welch-lke algorthm and a Gbbs samplng algorthm, are smlar to standard methods used to nfer HMM parameters. A thrd, the varatonal nference algorthm, s related to algorthms used for nference on topc models and more complex HMM extensons. We demonstrate, by comparng results on proten sequence classfcaton usng our method n conjuncton wth SVMs, that each of these algorthms nfers transton matrces that capture useful characterstcs of ndvdual sequences. Because our model fts wthn a large exstng body of work on generatve models, we are especally nterested n related models that perform classfcaton drectly. References [Ble et al., 2003] D.M. Ble, A.Y. Ng, and M.I. Jordan. Latent drchlet allocaton. The Journal of Machne Learnng Research, 3: , [Crammer and Snger, 2002] K. Crammer and Y. Snger. On the algorthmc mplementaton of multclass kernel-based vector machnes. The Journal of Machne Learnng Research, 2: , [Eddy, 1998] S. Eddy. Profle hdden markov models. Bonformatcs, 14(9): , [Hofmann, 1999] T. Hofmann. Probablstc latent semantc ndexng. In Proceedngs of the 22nd annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pages ACM, [Jaakkola and Haussler, 1999] T.S. Jaakkola and D. Haussler. Explotng generatve models n dscrmnatve classfers. Advances n neural nformaton processng systems, pages , [Jaakkola et al., 2000] T. Jaakkola, M. Dekhans, and D. Haussler. A dscrmnatve framework for detectng remote proten homologes. Journal of Computatonal Bology, 7(1-2):95 114, [Jebara et al., 2004] T. Jebara, R. Kondor, and A. Howard. Probablty product kernels. The Journal of Machne Learnng Research, 5: , [Joachms, 1999] T. Joachms. SVMLght: Support Vector Machne. SVM-Lght Support Vector Machne joachms. org/, Unversty of Dortmund, [Kuang et al., 2004] R. Kuang, E. Ie, K. Wang, K. Wang, M. Sddq, Y. Freund, and C. Lesle. Profle-based strng kernels for remote homology detecton and motf extracton. Computatonal Systems Bonformatcs, pages , [Lesle et al., 2002] C. Lesle, E. Eskn, and W. S. Noble. The spectrum kernel: A strng kernel for svm proten classfcaton. Proceedngs of the Pacfc Symposum on Bocomputng, pages , [Lesle et al., 2003] C. Lesle, E. Eskn, W. S. Noble, and J. Weston. Msmatch strng kernels for svm proten classfcaton. Advances n Neural Informaton Processng Systems, 20(4): , [Murzn et al., 1995] A.G. Murzn, S.E. Brenner, T. Hubbard, and C. Chotha. SCOP: a structural classfcaton of protens database for the nvestgaton of sequences and structures. Journal of molecular bology, 247(4): , [Rabner and Juang, 1986] L. Rabner and B. Juang. An ntroducton to hdden Markov models. IEEE ASSp Magazne, 3(1 Part 1):4 16, [Rangwala and Karyps, 2005] H. Rangwala and G. Karyps. Profle-based drect kernels for remote homology detecton and fold recognton. Bonformatcs, 21(23):4239, [Rangwala and Karyps, 2006] Huzefa Rangwala and George Karyps. Buldng multclass classfers for remote homology detecton and fold recognton. BMC Bonformatcs, 7:455, [Scott, 2002] S.L. Scott. Bayesan methods for hdden Markov models: Recursve computng n the 21st century. Journal of the Amercan Statstcal Assocaton, 97(457): , [Smyth, 1997] P. Smyth. Clusterng sequences wth hdden Markov models. Advances n neural nformaton processng systems, pages ,

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science EECS 730 Introducton to Bonformatcs Sequence Algnment Luke Huan Electrcal Engneerng and Computer Scence http://people.eecs.ku.edu/~huan/ HMM Π s a set of states Transton Probabltes a kl Pr( l 1 k Probablty