Biological Sequence Mining Using Plausible Neural Network and its Application to Exon/intron Boundaries Prediction

Size: px

Start display at page:

Download "Biological Sequence Mining Using Plausible Neural Network and its Application to Exon/intron Boundaries Prediction"

Lawrence Cunningham
5 years ago
Views:

1 Bologcal Sequence Mnng Usng Plausble Neural Networ and ts Applcaton to Exon/ntron Boundares Predcton Kuochen L, Dar-en Chang, and Erc Roucha CECS, Unversty of Lousvlle, Lousvlle, KY 40292, USA Yuan Yan Chen PNN Technologes Inc, PO Box 7051, Woodbrdge, VA 22195, USA Abstract Bologcal sequence usually contans yet to fnd nowledge, and mnng bologcal sequences usually nvolves a huge dataset and long computaton tme. Common tass for bologcal sequence mnng are pattern dscovery, classfcaton and clusterng. The newly developed model, Plausble Neural Networ (PNN), provdes an ntutve and unfed archtecture for such a large dataset analyss. Ths paper ntroduces the basc concepts of the PNN, and explans how t s appled to bologcal sequence mnng. The specfc tas of bologcal sequence mnng, exon/ntron predcton, s mplemented by usng PNN. The expermental results show the capablty of solvng bologcal sequence mnng tass usng PNN. I. INTRODUCTION Research n computatonal bology has grown rapdly n the past ten years due to advances n molecular bology technques yeldng hgh-throughput data. One of the most mportant research nterests n computatonal bology s sequence mnng. It s much harder to extract nowledge hdden n the sequences than to generate bologcal sequences. Wth bologcal mutaton and evoluton, sequence datasets usually are enormous and complex. Analyss models become a crtcal factor for bologcal sequence mnng. Several tass related to sequence mnng such as pattern dscovery, classfcaton, predcton, and clusterng, can be mplemented by statstcal, neural networ, or data mnng models [1][2]. Those models can be used to capture the nowledge or patterns n order to predct, classfy, or analyze sequence data. A newly developed neural networ model, plausble neural networ (PNN), whch combnes probablstc and possblstc nferences for bnary random varables was ntroduced n [3] and subsequently patented by Chen n 2003 [4]. PNN s smlar to Hopfeld neural networs n that both are bdrectonal and symmetrcal. But n nference nterpretaton, PNN s smlar to Bayesan neural networs. Specfcally the synaptc weghts n PNN and Bayesan neural networs have probablstc meanngs and thus are transparent. However, PNN wth the weghts based on mutual nformaton content as descrbed n secton II provdes two mportant features. Frst, mutual nformaton content weghts put the PNN archtecture of wnner-tae-all actvaton of competng neurons on a sold statstcal nference foundaton, n whch no pror dstrbuton s requred as n Bayesan neural networs. Second, mutual nformaton content weghts can be used to compute Shannon mutual nformaton between attrbutes wth lttle added cost. Several advantages of PNN lead to applyng t to bologcal sequence mnng. The PNN model s very ntutve n that the user can easly tran t wth dfferent ntal states to dscover patterns hdden n large-scale, complex data. In addton, the model s very general that t can be appled to clusterng, predcton, classfcaton, and pattern dscovery wth the same archtecture. Moreover, PNN represents mssng data as a null vector so that they do not partcpate n the PNN nference process. Most of other clusterng and classfcaton methods handle mssng data by addng artfcal data or removng the ncomplete data. Snce the learnng process and actvaton method n PNN are smple, the computaton for tranng or actvaton can be fnshed n a short perod of tme. Ths paper appled PNN to one of the bologcal sequence mnng tass, the predcton of exon/ntron boundares. The accuracy rate comparng to other computatonal ntellgent models shows the qualty result of PNN. II. A PNN MODEL FOR BIOLOGICAL SEQUENCE MINING A. PNN Archtecture A basc PNN archtecture for bologcal sequence mnng conssts of two layers (nput layer and hdden layer) of cooperatve and compettve neurons. The connecton between nput neurons and hdden neurons s complete, bdrectonal, and symmetrc. The compettve neurons are grouped nto a wnner-tae-all (WTA) ensemble and, n turn, WTA ensembles cooperate n decson mang. Fg.1 llustrates the archtecture for DNA sequence data. The hdden neuron WTA ensemble represents un-labeled (thus unnown or hdden) patterns of smlar sequences. Each nput WTA ensemble encodes the values of an nput attrbute, whch s ether a sequence base or the sequence

2 class. For example, n Fg. 1 the nput WTA ensemble (IE, EI, N) encodes the values of attrbute Class (.e. DNA sequence class label) and each WTA ensemble (A, C, G, T) encodes a DNA base n the sequence. C. Connecton Weghts Assume each neuron n the PNN s a bnary random varable. The connecton weght between an nput neuron and a hdden neuron Y s gven as follows P( = 1, Y = 1) ω = ln (2) P( = 1) P( Y = 1) In (2), P denotes the probablty of the enclosed event. The weght (2) contans the frng hstory or mutual nformaton content of two connected neurons. It s clear that and Y are postvely assocated (exctatory), statstcally ndependent, or negatvely assocated (nhbtory) f ω s postve, 0, or negatve, respectvely. Note that bnary random varable s assumed to defne (2), but the estmaton of (2) as descrbed n secton E and PNN applcatons also apply to neurons satsfyng (1). Fg.1. PNN archtecture for predctng exon/ntron boundares B. Attrbute Value Codng PNN encodes categorcal or contnuous attrbutes nto WTA ensembles. For sequence mnng, only the categorcal attrbute encodng s requred. For a categorcal attrbute wth categores, ts value s encoded as a WTA ensemble of bnary-valued (.e. 0 or 1) neurons ( 1, 2,, ). For example, splce-uncton gene sequences datasets may have a class attrbute wth three possble values IE, EI, and N representng ntron to exon boundary, exon to ntron boundary, and None, respectvely. For such examples, the class attrbute can be encoded n a three-neuron WTA ensemble (IE, EI, N). The nput vector (1, 0, 0) for the class attrbute WTA ensemble ndcates the sequence s of the IE class, (0, 1, 0) ndcates EI, and (0, 0, 1) ndcates N. Each base n a DNA sequence s encoded by a four-neuron WTA ensemble (A, C, G, T). Ths DNA encodng can allow for uncertan base value gven n the IUPAC code as well. For example, f the base value s R (means G or C), the nput vector for the base s (0, 0.5, 0.5, 0). Mssng attrbute values are represented by a null vector,.e. all s n the attrbute ensemble are zero. Gven the codng scheme, an nput WTA ensemble ( 1, 2,, ) representng an attrbute value satsfes the followng condtons: = 1 = 1 and 0 1 for all (1) Furthermore, mssng attrbute values are represented by a null vector,.e. all s n the attrbute ensemble are zero. Wth ths codng, snce the nput from every neuron n the attrbute ensemble s zero, actvaton potental (.e. nput tmes the weght) contrbuted by the attrbute s zero. As a consequence, the attrbute wth mssng value wll not partcpate n the WTA actvaton of the competng hdden neurons. D. Forward and Reverse Frng The PNN s bdrectonal snce the frng n PNN can be carred out n both drectons between nput neurons and hdden neurons. For convenence, forward frng s referred to the actvaton of hdden neurons gven the states of the nput neurons and reverse frng s referred to the actvaton of nput neurons gven the states of hdden neurons. Suppose an ensemble of compettve WTA hdden neurons Y 1, Y 2,, Y n receve nput sgnals from the nput neurons 1, 2,, m. The frng states of Y 1, Y 2,, Y n are computed n the followng steps: 1. Compute the actvaton potental, = 1, 2,, n, as follows = exp( ω ) (3) 2. Compute Max, the maxmum value of, = 1,2,..,n. 3. Perform α-cut (0 < α 1) on, = 1,2,..,n,.e. set to zero f Max α 4. Fnally calculate the actvaton level of Y, 1 n, by Y = (4) Step 3 s a soft WTA competton snce there could be multple wnners wth dfferent actvaton level. Alternatvely, one wnner could be chosen wth the largest actvaton level. The latter s called hard WTA. Step 4 s a normalzaton step to mae sum of Y equal to one. In the reverse frng, the states of the hdden neurons Y 1, Y 2,, Y n are gven and the computaton of the actvaton of

3 nput neurons s carred out for each nput WTA ensemble separately usng steps 1-4 above wth some modfcatons. Specfcally, for each nput WTA ensemble ( 1, 2,, ), the actvaton of neurons s s computed usng steps 1-4 wth these modfcatons: a. = exp( ϖ Y ), = 1, 2,..., where ϖ = ω ( ϖ = ω s why the PNN s sad to be symmetrc.) b. In the normalzaton step 4, =, 1. c. Usually, the α value chosen for ths actvaton s much smaller than that used for the forward frng actvaton. E. The Prncple of Inverse Inference The PNN process of nput data rows s va the WTA competton among the hdden neurons. Assumng the hard WTA competton s used, one hdden neuron wll become the wnner of an nput data row. The outcome of the WTA competton s udged by the actvaton level of each hdden neuron from the forward frng of a data row as descrbed n subsecton D. The hdden neuron wth the hghest actvaton level wll become the wnner (.e. the data row n queston belongs to the cluster represented by the hdden neuron.) Ths competton process can be ustfed by the prncple of nverse nference, whch s stated as Gven the evdence, the more probable a hypothess can produce such evdence the more lely t to be true [3]. To paraphrase the prncple n PNN terms, gven an nput data row (.e. gven nput neurons values), the more probable hdden neuron can response to such an nput the more lely the data row belongs to the cluster represented by the hdden neuron. Suppose 1, 2,, and m denote the states (0 or 1) of the nput neurons n a PNN, the evdence (or the possblty) of a hdden neuron, Y, respondng to the nput can be measured by the followng condtonal probablty: P,,, m Y ) (5) ( 1 2 Therefore, to ustfy the prncple of nverse nference as appled to PNN, t needs to show that the wnnng hdden neuron Y udged by the forward frng usng the weght (2) has ndeed the hghest quantty (5). Usng (2) to compute the actvaton potental n (3), yelds ω = ln( P(,, Y ) / P( )) (7) 1 Pluggng (7) nto (3), the hdden neuron Y s pre-wta actvaton value,, becomes: m = P( 1, Y ) / P( ) (8) m Comparng (8) and (5), s equal to the condtonal probablty (5) dvded by a factor common to all hdden neurons. As a result, usng (8) to determne the wnnng hdden neuron yelds the same result as usng (5). Ths concluded the ustfcaton of the prncple of nverse nference used n PNN. Furthermore snce the sum of (5) over all Y s one, the normalzaton actually computes (5) as the actvaton level for the hdden neuron Y. The statstcal ndependence assumpton on nput attrbutes s a lmtaton of PNN as that n the naïve Bayesan neural networs. However, n data exploraton wthout any pror nowledge of the data, the lmtaton would not be a bg problem n PNN applcatons. One mportant note about PNN s no pror dstrbuton s needed due to the unform pror present n the denomnator of (8). Ths s a huge advantage over Bayesan neural networs, whch requre a dffcult tas of estmatng pror dstrbuton. F. PNN Tranng To apply PNN, a tranng (learnng) procedure s needed to estmate the weght (2). The PNN tranng algorthm estmates the weghts from a gven dataset. Gven the past co-frng hstory of two neurons and Y, (x, y ), = 1,2,,n, the weght (2) between and Y n a PNN can be estmated by n x y = 1 ω = ln( n n x y = 1 n = 1 Suppose that a tranng set of DNA sequences s gven. Frst, fre the hdden neurons randomly for each sequence to produce an ntal fre matrx. Based on the ntal fre matrx and tranng sequences, calculate the new weghts between nput neurons and hdden neurons usng (3). In turn, use the new weghts n the forward frng of tranng sequences to get a new fre matrx. Repeat ths process untl the PNN s stable,.e. untl the fre matrx s not changed. Fg.2 shows the PNN tranng algorthm. ) (3) = ln( P( Y ) ω ln( P( )) (6) If the nput attrbutes are statstcally ndependent, (6) can be reduced to:

4 5. 1-nn: a 1-nearest neghbor neural networ 6. FSlNN: a 1-nearest neghbor neural networ wth a prevous feature selecton 7. -nn: a -nearest neghbor neural networ 8. FSKNN: a -nearest neghbor neural networ wth a prevous feature selecton In order to compare PNN to the above algorthms, same dataset s used to test the PNN and the accuracy rate s evaluated. The tested PNN contans 8 hdden neurons and the soft WTA competton s used for the networ. Table 1 shows the accuracy rates of PNN and the other algorthms obtaned from [8]. The result shows the PNN acheved hgher accuracy rate than any algorthms lsted above. Table 1. Accuracy rate of E/I I/E predcton usng dfferent algorthms PNN ID3 BP LVQ FS+LVQ nn FS1NN -nn FSKNN Fg.2. The PNN tranng algorthm III. EPERIENTS A. Datasets Two datasets are used to test the performance of PNN. The frst dataset s downloaded from [7]. Ths dataset contans 3190 sequences whch are taen from the GenBan Each data sample contans a class attrbute (I/E, E/I, or N), nstance name, and 60-base sequence wth 30 bases on ether sde of the boundary. 767 sequences are classfed as E/I. 768 sequences are classfed as I/E sequences are classfed as N. In order to determne PNNs could be used to determne splce ste predcton wth large dataset, a set of sequences used to benchmar gene predcton programs was downloaded from [5]. The second dataset contans 9204 sequences extracted from 570 dfferent genes. From ths set of sequences, 2629 Exon/Intron (E/I) boundares and 2634 Intron/Exon (I/E) were extracted, wth 30 bases of sequence on ether sde of the boundary reported. In addton, base sequences not occurrng n an E/I or I/E (reported as an N) were extracted. Ths dataset was then separated randomly nto two parts. One was used to tran a PNN wth 8 hdden neurons to dstngush between these three classes. The other one was used to test the performance of the PNN for I/E and E/I recognton. B. Comparng methods Eght dfferent classfcaton algorthms are tested usng the frst sequence dataset n [8]. The algorthms are lsted as follows: 1. ID3: a decsson tree algorthm 2. BP: the bacpropagaton algorthm 3. LVQ: a combnaton of procedures of LVQ 4. FS+LVQ: a LVQ learnng wth feature selecton C. Evaluatng the performance of PNN The second dataset s employed to test the PNN performance of handlng the large dataset. From the total of 9204 sequences, 6204 randomly selected sequences were used to tran the PNN, and the other 3000 sequences were used to test the PNN. The PNN contans 8 hdden neurons and 243 nput neurons (3 for the class attrbute and 4 for each base). Alpha cut was set to 0.8 n order to have more precse predcton. The PNN was traned for 200 teratons and the tranng was completed n mnutes but yet to meet the stable condton. The 3000 testng sequences then nput to the traned PNN to test the performance. The result showed that 2830 out 3000 sequences were correctly classfed. In other words, 94.3% accuracy rate was acheved. The performance parameters: senstvty (Sn), specfcty (Sp), and correlaton co-effcency (CC) were calculated to assess the performance of the PNN. The defntons of Sn, Sp and CC are as follows: Sn = TP TP + FN Sp = TN TN + FP CC = TP TN FP FN ( TP + FP ) ( TP + FN ) ( TN + FP ) ( TN + FN where TP stands for True Postve, FN: for False Postve, TN for True Negatve, and FN for False Negatve Table 2 shows the performance parameters of ths experment. Notce Sn, Sp, and CC have rather hgh percentages for both the I/E and E/I predctons usng PNN even the PNN had not yet reached an optmal state. However, the tranng was extended for couple more hundreds of teratons, but the performance was not mproved sgnfcantly. To mprove tranng performance, the next experment used a larger tranng dataset. )

5 Table 2. Performance parameters of E/I and I/E predctons for 3000 testng sequences E/I I/E In ths experment, the tranng dataset was ncreased to 7204 randomly selected sequences. The rest of the 2000 sequences were used to test the performance of the PNN. The same PNN from the prevous experment was confgured for ths experment. The tranng of the PNN was fnshed surprsngly n 58 teratons and 1883 out of 2000 testng sequences were correctly classfed. Therefore, the PNN acheved about 94% accuracy rate n a very short perod of tranng tme. Table 3 also shows the performance parameters of ths experment. Table 3. Performance parameters of E/I and I/E predctons for 2000 testng sequences E/I I/E The measurement of accuracy for ths experment shows the smlar result from the prevous experment. Addtonally, for msclassfcaton, the PNN provded some useful nformaton (e.g. 0.45: N 0.55:E/I) whch could be used for further analyss. Snce the tranng of PNN s based on mutual nformaton nstead of error correcton, to mprove the performance, a smple tunng algorthm can be appled. It smply repeats the tranng process and records the confguraton of the PNN, whch has the best accuracy rate for the classfcaton. Table 4. Measurement of E/I and I/E predcton for 2000 testng sequences E/I I/E Table 4 shows the measurement of the PNN wth the tunng algorthm. Comparng to 117 msclassfed testng sequences from experment 2, only 89 msclassfed testng sequences were found out of 2000 sequences. Furthermore, the I/E predcton had sgnfcantly mproved n Sn and CC as shown n Table 4. V. FUTURE WORK PNN has shown the abltes to solve problems such as classfcaton, clusterng, pattern recognton, functon approxmaton, etc [9] [10]. In ths paper, PNN has shown promse for gene splce-uncton recognton. The fast convergence of PNN maes t feasble for large bologcal sequence analyss. For future wor, the study of new uncton pattern dscovery and mprovement of PNN for sequence mnng n general are nterestng to be mplemented. In addton, the possblty of applyng PNN to promoter predcton and proten profle pattern recognton wll also be consdered. REFERENCES [1] B. S. Evertt and G. Dunn, Appled Multvarate Data Analyss. London: Arnold, 2001, ch. 6. [2] J. Han and M. Kamber, Data Mnng: Concepts and Technques. Morgan Kaufmann Publshers, 2001, ch. 7 and ch. 8. [3] Y. Y. Chen, Plausble neural networs, Advance n Neural Networs World, A. Grmela and N. E. Mastoras, Ed. WSEAS Press 2002, pp [4] Y. Y. Chen, Plausble neural networ wth supervsed and unsupervsed cluster analyss, U.S. Patent , July 24, [5] M. Burset and R. Gugo, Evaluaton of gene structure predcton programs, Genomcs 34: , [6] E.Y. Chen,, M. ollo, R.A. Mazzarella, A. Cccodcola, C.N. Chen, L. uo, C. Hener, F.W. Burough, M. Rpetto, D. Schlessnger and M. D'Urso, Long-range sequence analyss n q28: thrteen nown and sx canddate genes n b of hgh GC DNA between the RCP/GCP and G6PD loc, Hum. Mol. Genet. 5 (5), 1966, pp vol. 7, no. 3, pp , June [7] D.J. Newman, S. Hettch, C.L. Blae, and C.J. Merz, UCI Repostory of machne learnng databases, Unversty of Calforna, Department of Informaton and Computer Scence, Irvne, CA, 1998 < [8] M. Mar Abad Grau, and L.D.H Molnero, Feature selecton n codeboo based methods provdes hgh accuracy, IJCNN 99, vol.3, [9] K. L, D. Chang, and Y. Y. Chen, "Hgh-speed B-drectonal Functon Approxmaton Usng Plausble Neural Networs," Proceedngs of IJCNN'06, [10] K. L and D. Chang,"Fuzzy Membershp Functon Elctaton usng Plausble Neural Networ," Proceedngs of ICAI'06 Volume I, 2006, pp IV. CONCLUSION PNN not only shows good performance for the predcton of exon/ntron boundares but also can dscover the patterns of exon/ntron boundares. For a traned PNN, each hdden neuron represents a found pattern. The pattern can be easly extracted by reverse-frng the partcular hdden neuron. The extracted nowledge then provdes better understandng of the dataset. Thus, the number of hdden neurons s crucal to the PNN performance. In ths paper, dfferent numbers of hdden neurons are tested n the experments. The PNN wth 8 hdden neurons shows the best performance regardng to the accuracy rate and the tranng tme. However, to decde the mnmum number of hdden neurons to acheve desred performance crtera s stll an open ssue.

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like: Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A