Protein Secondary Structure Prediction Using Support Vector Machines, Nueral Networks and Genetic Algorithms

Size: px

Start display at page:

Download "Protein Secondary Structure Prediction Using Support Vector Machines, Nueral Networks and Genetic Algorithms"

Janice Ferguson
5 years ago
Views:

1 Georga State Unversty Georga State Unversty Computer Scence Theses Department of Computer Scence Proten Secondary Structure Predcton Usng Support Vector Machnes, Nueral Networks and Genetc Algorthms Anjum B. Reyaz-Ahmed Follow ths and addtonal works at: Recommended Ctaton Reyaz-Ahmed, Anjum B., "Proten Secondary Structure Predcton Usng Support Vector Machnes, Nueral Networks and Genetc Algorthms." Thess, Georga State Unversty, Ths Thess s brought to you for free and open access by the Department of Computer Scence at Georga State Unversty. It has been accepted for ncluson n Computer Scence Theses by an authorzed admnstrator of Georga State Unversty. For more nformaton, please contact scholarworks@gsu.edu.

2 PROTEIN SECONDARY STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINES, NUERAL NETWORKS AND GENETIC ALGORITHMS by ANJUM B REYAZ-AHMED Under the Drecton of Yanqng Zhang ABSTRACT Bonformatcs technques to proten secondary structure predcton mostly depend on the nformaton avalable n amno acd sequence. Support vector machnes (SVM) have shown strong generalzaton ablty n a number of applcaton areas, ncludng proten structure predcton. In ths study, a new sldng wndow scheme s ntroduced wth multple wndows to form the proten data for tranng and testng SVM. Orthogonal encodng scheme coupled wth BLOSUM62 matrx s used to make the predcton. Frst the predcton of bnary classfers usng multple wndows s compared wth sngle wndow scheme, the results shows sngle wndow not to be good n all cases. Two new classfers are ntroduced for effectve tertary classfcaton. Ths new classfers use neural networks and genetc algorthms to optmze the accuracy of the tertary classfer. The accuracy level of the new archtectures are determned and compared wth other studes. The tertary archtecture s better than most avalable technques. INDEX WORDS: Bnary classfer, BLOSUM62, encodng scheme, orthogonal profle, support vector machne (SVM), tertary classfer.

3 PROTEIN SECONDARY STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINES, NUERAL NETWORKS AND GENETIC ALGORITHMS by ANJUM B REYAZ-AHMED Under the Drecton of Yanqng Zhang A Thess Submtted n Partal Fulfllment of the Requrements for the Degree of Master of Scence n the College of Arts and Scences Georga State Unversty 2007

4 Copyrght by Anjum B Reyaz-Ahmed 2007

5 PROTEIN SECONDARY STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINES, NUERAL NETWORKS AND GENETIC ALGORITHMS by ANJUM B REYAZ-AHMED Under the Drecton of Yanqng Zhang Major Professor: Commttee: Yanqng Zhang Saed Belkasm Yngshu L Electronc Verson Approved: Offce of Graduate Studes College of Arts and Scences Georga State Unversty May 2007

6 v ACKNOWLEDGEMENTS I wsh to take ths opportunty to thank many people wthout whom ths thess would not have been accomplshed. Frst and foremost, I would lke to thank my thess advsor, Dr. Yanqng Zhang. I was able to acheve ths task, wth hs help, gudance, encouragement, and the tme that he has spent on drectng my thess. I also wsh to thank my thess commttee members, Dr. Saed Belkasm and Dr. Yngshu L, for takng tme to evaluate my smulaton results and to revew my thess document. I would lke to express my grattude to Dr. Hyunsoo Km from Georga Tech, who helped me understand the PSSM proflng. I would lke to thank my fellow department members and all my frends, who patently lsten to all my doubts and queres and helped me n many ways. Last but not least, I wsh to express my grattude to my parents and my brother who have pushed me ths far. I most certanly should thank my husband Ashraf, who supported me and encouraged me even when I kept buggng hm. Declaraton I hereby declare that, except where otherwse ndcated, ths document s entrely my own work and has not been submtted n whole or n part to any other unversty. Sgned:...Date:...

7 v TABLE OF CONTENTS ACKNOWLEDGEMENTS...IV LIST OF TABLES... VII LIST OF FIGURES...VIII LIST OF ABBREVIATION...IX 1. INTRODUCTION PROTEIN SECONDARY STRUCTURE PREDICTION PROTEINS PROTEIN STRUCTURE PROTEIN STRUCTURE PREDICTION PROTEIN SECONDARY STRUCTURE PREDICTION SUPPORT VECTOR MACHINES OVERVIEW OF SVM LINEAR CLASSIFICATION [18] SUPPORT VECTOR ALGORITHM SVM SOFTWARE RELATED SVM-BASED METHODS SECONDARY STRUCTURE ASSIGNMENT TRAINING AND TESTING DATA SETS DATA PRE-PROCESSING... 30

8 v 4.4. PARAMETER OPTIMIZATION OF THE BINARY CLASSIFIER BINARY CLASSIFIER CONSTRUCTION TERTIARY CLASSIFIER DESIGN NEW SVM-BASED METHODS BASIC CONCEPTS MULTIPLE WINDOWS : NEW ENCODING SCHEME GENETIC NEURAL NETWORK USING SVM NEURONS: SVM_GA+NN SVM_COMPLETE: NEW TERTIARY CLASSIFIER OF THIS STUDY SIMULATION RESULTS ACCURACY MEASURE OF SVM SINGLE WINDOW VS. MULTIPLE WINDOWS TERTIARY CLASSIFIERS DIFFERENT ENCODING SCHEME CONCLUSIONS AND FUTURE WORKS CONCLUSIONS FUTURE WORKS REFERENCES APPENDIX... 69

9 v LIST OF TABLES Table 2-1 Performance comparson of varous predcton methods [32] Table to-3 state reducton method n secondary structure assgnment Table 6-1 Comparng Sngle Wndow and Multple Wndows Table 6-2 Smulaton II: Sngle Wndow vs. Multple Wndows Table 6-3 Accuracy of tertary Classfers on the RS 126 data set. Combned results of 7-fold cross valdaton are shown Table 6-4 Q 3 (%) after 7-fold Cross Valdaton for Wndow of sze Table 6-5 Q 3 (%) after 7-fold Cross Valdaton for Wndow of sze Table 6-6 Q 3 (%) after 7-fold Cross Valdaton for Wndow of sze Table 6-7 Accuracy of Tertary Classfer Usng Multple Wndows Scheme Table 6-8 Accuracy levels of Tertary Classfers... 62

10 v LIST OF FIGURES Fgure 2-1 Prmary and Secondary structure of proten [31]... 6 Fgure 2-2 Three typcal secondary structure elements [31] Fgure 2-3 Tertary and Quaternary structure of proten [31]... 7 Fgure 2-4 Applcaton of Proten Secondary Structure Predcton... 9 Fgure 3-1 A separatng hyperplane for a 2-d tranng set [18] Fgure 3-2 Lnear separatng hyperplane for non-separable case Fgure 3-3 Transformaton to hgher dmensonal space Fgure 4-1 RS 126 data sets Fgure 4-2 Sldng wndow scheme wth wndow length of Fgure 4-3 Sample Tranng Data wth Orthogonal Encodng of Wndow Sze Fgure 4-4 An Example of Orthogonal Vector Profle Fgure 4-5 Typcal kernel Functons n SVM Fgure 4-6 Tree based tertary classfers Fgure 4-7 DAG scheme Fgure 5-1 Multple Wndow Scheme Fgure 5-2 Novel Neural Network Usng SVM Neurons Fgure 6-1 Comparng Sngle Wndow and Multple Wndows Fgure 6-2 Smulaton II: Sngle Wndow vs. Multple Wndows Fgure 6-3 Accuracy of tertary classfers on the RS 126 dataset Fgure 6-4 Showng Q3, QH, QE and QC % of Dfferent Classfers Usng Multple Wndows Scheme... 61

11 x LIST OF ABBREVIATION BLOSUM Block Substtuton Matrx DAG Drected Acyclc Graph DSSP Database of Secondary Structure n Protens ERM Emprcal Rsk Mnmzaton OSH Optmal Separatng Hyperplane PSSM Poston Specfc Scorng Matrx RBF Radal Bass Functon SOV Segment Overlap Measure SRM Structural Rsk Mnmzaton SVM Support Vector Machnes

12 1 1. INTRODUCTION Proten secondary structure s closely related to the proten tertary structure, whch determnes the characterstcally behavor of the protens. Many researches have been done over the decades to study and predct the proten structure. Tll date, the total number of expermentally determned structures s less than twenty thousand (Proten Data Bank) whereas there are over a mllon known proten sequences. It s therefore becomng ncreasngly mportant to predct proten structure from ts amno acd sequence, usng nsght obtaned from already known structures. Neural networks or the support vector machnes are the generally adopted technques to predct proten secondary structure. The SVM method s a comparatvely new learnng system whch has mostly been used n pattern recognton problems. Ths machnes uses hypothess space of lnear functons n a hgh-dmensonal feature space, and t s traned wth a learnng algorthm based on optmzaton theory. To compare the results of ths study wth prevous results RS126 data set s used. RS 126 data set [1] s proposed by Rost and Sander. Among neural networks Chandona and Karplus [2] ntroduced a novel method for processng and decodng the proten sequence wth NNs by usng large tranng data set such as 681 non homologous protens. And wth the use of jury method, ths scheme recorded 74.8% accuracy. Some of the recent studes adoptng ths SVM learnng machne for secondary structure predcton are the one whch used frequent profles wth evolutonary nformaton as an encodng schemes for SVM [3], the one whch adopted two layers of SVM, wth a weghted cost functon [4] and the one whch appled PSI-BLAST PSSM profles [5] as an nput vector and sldng

13 2 wndow scheme wth SVM_Representatve archtecture [6]. All then show good predcton accuraces over 70%. Support Vector Machne s learnng system that uses a hgh dmensonal feature space, traned wth a learnng algorthm from optmzaton theory. Snce SVM has many advantageous features ncludng effectve avodance of over-fttng, the ablty to manage large feature spaces, and nformaton condensng of the gven data, t has been gradually appled to pattern classfcaton problem n bology.[7] In ths study, the sngle sldng wndow scheme used to form the data for bnary classfer s challenged wth multple wndows scheme. Both the bnary classfers show more or less the same accuraces. In some cases sngle wndow s better and n some multple. Snce the optmal wndow sze s already determned by Hu s method [6] the same s used here. Then two novel tertary archtectures are ntroduced. In both the archtectures the results have the better accuracy when compared to the method proposed by Hu [6]. In Hu s method the tertary classfer uses only three of the sx bnary classfers. In the proposed methods of ths paper all the sx bnary classfers are appled to form the tertary classfer so that to gve a completeness to the results. Both sngle wndow as well as multple wndow schemes was tested for gettng the best results. Based on the result of performance comparson of bnary classfers wth prevous studes, the optmzed encodng scheme of ths study, the combned matrx of orthogonal and BLOSUM62 matrx, was not satsfactory. However, wth the use of new tertary classfers of

14 3 ths study, the performance was enhanced. Ths has been proved by smulaton results of 7-fold data sets wth Hu s method.

15 4 2. PROTEIN SECONDARY STRUCTURE PREDICTION 2.1. Protens Protens are one of the most basc components n all organsms. They form the bass of cellular and molecular lfe and affect the structural and functonal characterstcs of the varous cells and genes sgnfcantly. On the smallest level, protens are made up of lnear sequences of twenty natural amno acds joned together by peptde bonds. Ths s known as the prmary structure of a proten. Ths sequence determnes the next levels of structures that are formed. Change n a sngle acd n a crtcal area of proten can alter the bologcal functon. The secondary structure of a proten s the foldng or colng of the peptde chan. The tertary structure s the three dmensonal shape of the polypeptde chan. The quaternary structure s the fnal dmensonal structure formed by all the polypeptde chans makng up a proten. Knowng amno acd sequences s mportant for several reasons [8]. Frst, knowledge of the sequence of a proten s usually essental to elucdatng ts mechansm of acton, such as the catalytc mechansm of an enzyme. Moreover, protens wth novel propertes can be generated by varyng the sequence of known protens. Second, amno acd sequence determne the three dmensonal structures of protens. Amno acd sequence s the lnk between the genetc message n DNA and the three-dmensonal structure that performs a proten s bologcal functon. Analyses of relatons between amno acd sequences and 3-D structures of protens are uncoverng the rules that govern the foldng of polypeptde chans.

16 5 Thrd, sequence determnaton s a component of molecular pathology, a rapdly growng area of medcne. Alteratons n amno acd sequence can produce abnormal functon and dsease. Severe and sometmes fatal dseases, such as sckle-cell anema and cystc fbross, can result from a change n a sngle amno acd wthn proten. Fourth, the sequence of a proten reveals much about ts evolutonary hstory. Protens resemble one another n amno acd sequence only f they have a common ancestor. Consequently molecular events n evoluton can be traced from amno acd sequences Proten Structure Protens are an mportant class of bologcal macromolecules present n all bologcal organsms, made up of such elements as carbon, hydrogen, ntrogen, oxygen, and sulfur. All protens are polymers of amno acds. The polymers, also known as peptdes consst of a sequence of 20 dfferent L-α-amno acds, also referred to as resdues. For chans under 40 resdues the term peptde s frequently used nstead of proten. To be able to perform ther bologcal functon, protens fold nto one, or more, specfc spatal conformatons drven by a number of non-covalent nteractons. In order to understand the functons of protens at a molecular level, t s often necessary to determne the three dmensonal structure of protens. Proten has a structural herarchy contanng prmary, secondary, tertary and quaternary. The prmary structure defnes the lnear sequence of assembled amno acds. At each end of the proten, there remans a free amno or carboxyl group, these are referred as N and C termnus whch represents the unbounded N and C atom n the amno or carboxyl group respectvely.

17 6 Fgure 2-1 Prmary and Secondary structure of proten [31]. The tertary structure s the compact globular structures called domans whch are determned and stablzed by chemcal bonds and forces, ncludng weak bonds, such as hydrogen bonds, onc bonds, van der vaals bonds and hydrophobc nteractons. Ths doman s the fundamental unt of tertary structure, snce t forms an ndependent stable structure and s the basc unt of proten functon. The quaternary structure s the fnal level of structural herarchy. Although some protens are monomerc whch consst of only one polypeptde chan, others are multmerc, constructed from several chans. There subunts may work cooperatvely, and the functonal state of one subunt can depend on the state of the other unts. In Fgure 2.3, the examples of tertary and quaternary structures are shown.

secondary structure of a proten s mportant as the latter leads to the predcton of the former.

18 7 Fgure 2-2 Three typcal secondary structure elements [31]. Fgure 2-3 Tertary and Quaternary structure of proten [31] Proten Structure Predcton To be able to fnd the tertary structure of the proten, relable predcton of the secondary structure of a proten s mportant as the latter leads to the predcton of the former. Proten structure can also be used to nfer bo-chemcal and bologcal functonal nformaton, and n dentfcaton of amno acds that are nvolved n actve ste.

19 8 The functonal propertes of protens depend upon ther three dmensonal structures. To understand the bologcal functons of protens, the structure of a proten from the amno acd sequence should be known beforehand. Therefore, knowledge of the tertary structure of the proten s a prerequste for the proper engneerng of ts functon. A predcted model of a proten can also serve as an excellent bass for dentfyng amno acd resdues nvolved n the actve ste and the model can be used for proten engneerng, drug desgn or mmunologcal studes. Wth ths approach, drugs can be developed to cure specfc dseases such as sckle cell dsease; Parknson s Alzhemer s dsease and may other nherted metabolc dsorders. The current expermental methods for determnng the secondary or tertary structure of protens are X-ray crystallography, nuclear magnetc resonance (NMR) and electron mcroscope [8]. However, each method has ts lmtatons because n addton to not beng able to characterze most of the transent or stable complexes that exst n a cell, they are also expensve, laborous and tme consumng, takng months or even years to complete 2.4. Proten Secondary Structure Predcton The predcton of secondary structure had been used as a steppng stone to fnd out the full 3-D structure snce t s consderable less complcated than full tertary structure predcton. The secondary structure of the proten refers to the nteractons that occur between CO and NH groups on ts amno acds to form folds and loops. Eght elements of the secondary assgnment were developed n 1983 [9] and are wdely used today, but secondary structure predcton s often lmted to three of those: α helx, extended β- sheet and col.

20 9 Besdes the applcaton of the three dmensonal structure predctons, secondary structure predctons have other advantageous applcatons. For example, predcted secondary structure can be used to nfer bo-chemcal and bologcal functonal nformaton, and n dentfcaton of regons of the proten that wll lkely to undergo structural changes. Fgure 2.4 shows the possble applcatons of proten secondary structure predcton. There are four man approaches of secondary structure predcton and those are Emprcal Statstcal Methods Nearest Neghbor Method Hdden Markov Model Methods Machne learnng Methods Three-Dmensonal Structure Predcton Proten Classfcaton Secondary Structure Predcton Proten Functon Predcton Proten Conformatonal Swtch Predcton Fgure 2-4 Applcaton of Proten Secondary Structure Predcton.

21 Emprcal Statstcal Methods The emprcal statstcal methods of the proten structure predcton are among the frst programs to predct proten structure from amno acd sequence. The example of these emprcal methods are Chou-Fasman method and GOR method of secondary structure predcton that use values obtaned from known proten structures, whch at the tme of development of each method were qute small. The Chou-Fasman method, developed n 1974, was the frst algorthm desgned for predcton the secondary structure of globular protens [10]. Ths method assgns each ndvdual amno acd a value based on ther frequency of beng n an : α helx or β- strand that categorzes them nto sx groups based on frequency for each of the two secondary structure elements. A query sequence s the scanned of regon where three of fve amno acds gave a hgh probablty of beng n a β- strand, or four of sx amno acds have a hgh probablty of beng n an α helx, and the probablty of amno acds n ether drecton are calculated for beng n that type of structure untl the predcton value drops below a specfc value. After assgnng α helx and β- sheet to the approprate regons of the sequence accordng to the rules for ths method, remanng regons are analyzed for cols usng a more complex set of rules. A second statstcal method s that of Garner, Osguthorpe and Robson (GOR) [11]. Ths method calculates probablty values for a specfc amno acd based n the adjacent amno acds up to eght resdues away usng prncple of nformaton theory. The GOR method was developed n 1978 and has been updated many tmes snce then, the most recent verson based on the current database of 513 domans recommended by Cuff and Barton [12].

22 11 The accuracy of these statstcal methods was greatly mproved n 1993 when Rost and Sander ncluded multple sequence profles n theory secondary structure predcton programs [1]. These algorthms assume that hgher smlar evolutonary related sequence shares at least some secondary structure elements, partcularly at conserved stes. Recently, the program PSI-BLAST [5], whch can dentfy more dstant relatonshps, has been used to wden the profle by ncludng more evolutonarly related matches and ncrease accuracy [13] Nearest Neghbor Methods Nearest neghbor based methods dffer from other approaches such that they predct the secondary structure of a target proten usng local sequence smlarty to segments of known protens, usually through a sldng wndow, even when overall target protens sequence dffer substantally from the reference protens. Ths approach benefts from avalablty of numerous smlarty matches or from several hghly dentcal matches of known structures. Currently the two most popular nearest neghbor predcton servers are NNSSP [14] and Preator [15]. The NNSSP server, whch ncludes evolutonary nformaton through multple sequence algnments, has 72.2% correct predctons. The Predator [15] adopts local par-wsealgnment of the target sequence as opposed to the multple sequence algnment. The carefully selected algnment s derved from known structures.

23 Models Methods Based on Hdden Markov Hdden Markov Models have also been used to predct secondary structure usng the same concept as the nearest neghbor technque. Once a multple sequence algnment profle s bult usng short segments of smlar sequences wth known structure hdden Markov Models are generated n a structure context that s then used to predct the structure of the unknown proten [16]. The program HMMSTR developed by Bystroff et al. uses ths method clamng 74.3% accuracy [17] Neural Network Methods Ths method s based on the operaton of synaptc connectons n neurons of the bran, where nput s processed n several levels and mapped to a fnal output. The neural network s often traned to map specfc nput sgnals to a desred output. In the secondary structure predcton of neural networks, nput s a sldng wndow wth resdue sequence. Informaton from the central amno acd of each nput wndow s modfes by a weghtng factor, summed and sent to a second level, termed the hdden layer, where the sgnal s transformed nto a number close to ether 1 or zero, and then sent to three output unts representng each of the possble secondary structures [16]. The output unts each wegh the sgnal agan and sum them, and transform them nto ether a 1 or a 0. An output sgnal close to 1 for a secondary structure unt ndcates that the secondary structure of that unt s predcted and an output sgnal close to 0 ndcated that t s not predcted. Neural network are traned by adjustng the values of the weghts that modfy that sgnals usng a tranng set of sequences wth known structure. The neural network algorthm the

24 13 weght values untl the program has been optmzed to correctly predct the most resdues n the tranng set. Typcal neural network algorthms nclude PHD program developed by Rost and Sander [1] whch s traned n a test set of profles of multple sequences and clams 72.2% accuracy. Recently, ths accuracy was shown to be ncreased to 75% when larger databases and PSI-BLAST [5] was used to create the tranng set [14] Summary of Predcton Methods In Table 2.1, the performance of varous predcton technques s compared. In some cases, the verfed performance turned out to be lower than the performances reported by the authors. One reason for the dfference n the results could be the dfferent sets used for tranng and testng networks. Table 2-1 Performance comparson of varous predcton methods [32]. Type Method Year Generaton* Q3%Clamed Q3%Verfed Statstcal Chou and Fasman 1974 Frst Neural Network Garner et al. (Heler Qan and Sejnowsk Rost and Sander (PHD) Frst Second Thrd Pollastr et al. (Sspro) 2002 Thrd Nearest Neghbor Hdden Markov Models Sal.&Solovyev(NNSSP) Frsh.& Argos (Predator) Karplus et al. (SAM-T99) Bystroff et al. (Bystroff) Thrd Thrd Thrd Thrd

25 14 Note: *Generaton (Rost 2000): Frst: only sngle resdue statstcs used Second: sldng wndows and large database appled Thrd: long range nteracton through evolutonary nformaton ntroduced.

26 15 3. SUPPORT VECTOR MACHINES 3.1. Overvew of SVM Support Vector Machnes (SVM) are learnng systems that use a hypothess space of lnear functon n a hgh dmensonal feature space, traned wth a learnng algorthm from optmzaton theory that mplements a learnng bas derved from statstcal learnng theory. Ths learnng strategy ntroduced by Vapnk and co-workers s a prncpled and very powerful method that n the few years snce ts ntroducton has already outperformed most other systems n a wde varety of applcatons. In supervsed learnng the learnng machne s gven a tranng set of examples (or nputs) wth assocated labels (or output values). Usually the examples are n the form of attrbutes vectors, so that the nput space s a subset of R n. Once the attrbutes vectors are avalable, a number of sets of hypotheses could be chosen for the problem. Among these, lnear functons are best understood and smplest to apply. Tradtonal statstcs and the classcal neural networks lterature have developed many methods for dscrmnatng between two classes of nstances usng lnear functons. The problem of learnng from data has been nvestgated by phlosophers throughout hstory under the name of nductve nference. Although ths mght seem surprsng today, t was not untl 20 th century that pure nducton was recognzes as mpossble unless one assumes some pror knowledge. The development of learnng algorthm became an mportant sub feld of artfcal ntellgence, eventually formng the separate subject area of machne learnng. Kernel representatons offer an alternatve soluton by projectng the data nto a hgh dmensonal feature space to ncrease the computatonal power of the lnear learnng machnes.

27 16 Another attracton of kernel methods s that the learnng algorthms as theory can largely be decoupled from the specfcs of the applcaton area, whch must smply be encoded nto the desgn of an approprate kernel functon. Hence the problem of choosng archtecture for a neural network applcaton s replaced by the problem of choosng a sutable kernel for a Support Vector Machnes. The ntroducton of kernel greatly ncreases the expressve power of the learnng machnes whle retranng the underlyng lnearty that wll ensure that learnng remans tractable. The ncreased flexblty however, ncreases the rsk of over fttng as the choce of separatng hyperplane becomes ncreasngly ll-posted due to the number of degrees of freedom. Successfully controllng the ncreased flexblty of kernel-nduced feature spaces requres a sophstcated theory of generalzaton, whch s able to precsely descrbe whch factors have to be controlled n the learnng machnes n order to guarantee good generalzaton. There s a remarkable famly of bounds governng the relaton between the capacty of a learnng machnes and ts performance. The theory grew out of consderaton of under what crcumstances and how quckly, the mean of some emprcal quanttes converges unformly, as the number of data ponts ncreases, to the true mean [7]. Snce SVM approach has a number of superor such as effectve avodance of overfttng, the ablty to handle large feature spaces, nformaton condensng of the gven data set, t has been successfully appled to a wde range of pattern recognton problems, ncludng solated handwrtten dgt recognton, objectve recognton, speaker dentfcaton, and text categorzaton, etc [3].

28 Lnear Classfcaton [18] Bnary classfer s frequently mplemented by usng a real-valued functon n f : X R R n the followng way: the nput x = ( x,..., x )' s assgned to the postve class, f f ( x) 0 1 n, and otherwse to the negatve class. If we consder the case where f ( x) s a lnear functon of x X, so that t can be wrtten as ( x) = w x b f + = n = 1 w x + b n Where, ( b) R R decson rule gven by sgn ( f ( x) ) w, are the parameters that control the functon and the. And these parameters must be learned from the data. If we nterpret ths hypothess geometrcally, nput space X s splt nto two parts by the hyperplane defned by the equaton w x + b = 0. For example, n Fgure 3.1, the hyperplane s the dark lne, wth the postve regon above and the negatve regon below. The vector w defnes a drecton perpendcular to the hyperplane, whle varyng the value of b moves the hyperplane parallel to tself. And these quanttes are referred as the weght vector and bas whch are the terms borrowed from the neural networks lterature.

29 18 W b Fgure 3-1 A separatng hyperplane for a 2-d tranng set [18] Support Vector Algorthm Lnear SVM Maxmum Margn Classfer The tranng data s defned as a set { x,y }, 1 = 1,.l, y є { 1, 1 }, x є R d. Suppose we have some hyperplane whch separates the postve from negatve examples ( a separatng hyperplane ). The ponts x whch le on the hyperplane satsfy w. x + b = 0, where w s normal to the hyperplane, b / w s the perpendcular dstance from the hyperplane to the orgn, and w s the Eucldean norm of w. Let d + (d - ) be the shortest dstance from the separatng hyperplane to the closest postve (negatve) examples. Defne the margn of a separatng hyperplane to be d + +d -. For the lnearly separable case, the support vector algorthm smply looks for the separatng hyperplane wth the largest margn. Ths can be formulated as follows: suppose that all the tranng data satsfy the followng constran [18]. x w + b + 1 for y = +1 (3.1)

30 19 x w + b 1 for y = 1 (3.2) These can be combned nto one set of nequaltes: y ( x w + b) 1 for all (3.3) 0 Now consder the pont for whch the equalty n Eq. (3.1) holds (requres that there exsts such a pont n equvalent to choosng a scale for w and b). These ponts le on the hyperplane H 1 : x. w + b =1 wth normal w and perpendcular dstance from the orgn 1 b / w. Smlarly, the ponts for whch the equalty n Eq (3.2) holds le on the hyperplane H 2 : x. w + b = 1, wth normal agan w, and the perpendcular dstance from the orgn 1 b / w. Hence d + = d - = 1/ w and the margn s smple 2/ w. Note that H 1 and H 12 are parallel (they have the same normal) and that no tranng ponts fall between them. Thus we can fnd the par of hyper planes whch gves the maxmum margn by mnmzng w 2, subject to the constrants (3.3). Thus we expect the soluton for a typcal two dmensonal case to have the form shown n Fgure 2.5. Those tranng ponts for whch the equalty n Eq. (3.3) holds (.e. those whch wnd up lyng on one of the hyper planes H 1, H 1.), and whose removal would change the soluton found, are called support vectors; they are ndcated n Fgure 3.1 by extra crcles. Now Lagrangan formulaton of the problem s done. There are two reasons for dong ths. The frst s that the constrants (3.3) wll be replaced by constrants on the Lagrange multplers themselves, whch wll be much easer to handle [19]. The second s that n ths reformulaton of ths problem, the tranng data wll only appear (n the actual property and test algorthms) n the form of dot products between vectors. Ths s a crucal property whch wll allow us to generalze the procedure to the nonlnear case.

31 20 Thus, we ntroduce postve Lagrange multpler α, = 1,..., 1, one for each of the nequalty constrants (3). Recall that the rule s that for constrants of the form c 0, the constrant equatons are multplers and subtracted from the objectve functon, to form Lagrangan. For equalty constrants, the Lagrange multplers are unconstraned. Ths gves Lagrangan: L P 1 2 w 2 l = 1 ( x w + b) + l α y α (3.4) = 1 We must now mnmze L P wth respect to w, b, and smultaneously requre that the dervatves of L P wth respect to all the α vansh, all the subject to the constrants α 0 (let s call ths partcular set of constrants C 1 ). Now ths s a convex quadratc programmng problem, snce the objectve functon s tself convex, and those ponts whch satsfy the constrant also from the convex set (any lnear constrant defnes a convex set and a set of N smultaneously lnear constrant defnes the ntersecton of N convex sets whch s also a convex set). Ths means that we can equvalently solve the followng dual problem: maxmze L P, subject to the constrant that the gradent of L P wth respect to w and b vansh, and subject also to the constrant that the α 0 (let s call that partcular set of constrants C 2 ). Ths partcular dual formulaton of the problem s called the Wolfe dual. It has the property that the maxmum of L P, subject to constrant C 2, occurs at the same value of w and b and, as the mnmum of L P, subject to constrants C 1 [19]. Requrng that the gradent pf L P wth respect to w and b vansh gve the condtons:

32 21 w = α y x (3.5) α y = 0 (3.6) Eq (3.4) to gve Snce these are equalty constrants n the dual formulaton, we can substtute them nto L D 1 α α α j y y j x x j (3.7) 2, j Note that we have now gven the Lagrangan dfferent labels (P for prmal, D for dual) to emphasze that the two formulatons are dfferent: L P and L D arse from the same objectve functon but wth dfferent constrants; and the soluton s found by mnmzng L P or by maxmzng L D. Note also that f we formulate the problem wth b = 0, whch amounts to requrng that all hyper planes contan the orgn, the constrant (3.6) does not appear. Ths s a mld restrcton for hgh dmensonal spaces, snce t amounts to reducng the number of degree of freedoms by one. Support vector tranng (for separable lnear case) therefore amounts to maxmzng L D wth respect to theα, subject to constrant (3.6) and postvely of the α, wth soluton gven by (3.5). Notce that there s a Lagrange multpler α for every tranng pont. In the soluton, those ponts for whch α > 0 are called support vectors, and le on one of the hyper planes H 1, H 2. All other tranng pont have α = 0 and le ether on H 1 or H 2 (such that equalty n Eq. (3.3) holds), or on the sde of H 1 or H 2 such that the strct nequalty n Eq. (3.3) holds. For these

33 22 machnes support vectors are crtcal elements of tranng set. They le closest to the decson boundary; f all other tranngs ponts are removed (or moved around, but so as not to cross H 1 or H 2.), and the tranngs was repeated, the same separatng hyperplane would be found The Non-Separable Case The above algorthm for separable data, when appled to non-separable data, wll fnd no feasble soluton: ths wll be evdenced by the objectve functon).e. the dual Langrangan) growng arbtrarly large. So how can we extend these deas to handle non-separable data? We would lke to relax the constrants (3.1) and (3.2), but only when necessary, that s, we would lke to ntroduce a further cost (.e. an ncrease n the prmal objectve functon) for dong so. Ths can be done by ntroducng postve slack varables ξ, = 1,... 1 n the constrants, whch then become: x w + b + 1 ξ for y = +1 (3.8) x w + b 1 + ξ for y = 1 (3.9) Thus, for an error to occur, the correspondng ξ must exceed unty, so ξ s an upper bound on the number of tranng errors. Hence a natural way to assgn an extra cost for error s to change the objectve functon to be mnmzed from w 2 /2 to w 2 /2 + C( I ξ ) k, where C s a parameter to be chosen by the user, a larger C correspondng to assgnng a hgher penalty to error. [19] As t stands, ths s a convex programmng problem for any postve nteger k; for j=2 and k =1, t s also a quadratc programmng problem and the choce k = 1 has further advantage that nether the ξ, nor ther Lagrange multpler, appears n the Wolfe dual problem, whch becomes:

34 23 Maxmze: L D 1 α α α j y y j x x j (3.10) 2, j Subject to: 0 α (3.11) C α = 0 (3.12) y The soluton s agan gven by w = N S α y α y x (3.13) = 1 Where N s s the number of support vectors. Thus the only dfference from the optmal hyperplane case s that the α now have an upper bound of C. The stuaton s summarzed schematcally n Fgure 3.2.

35 24 Fgure 3-2 Lnear separatng hyperplane for non-separable case. Lagrangan s We wll need the Karush-Kuhn-Tucker condton for the prmal problem. The prmal L P = w + C { y ( x w + b) + ξ } ξ α 1 μ ξ (3.14) Where the μ are the Lagrange multplers ntroduced to enforce postvty of the ξ. The KKT condton for the prmal problem are therefore (note I runs from 1 to the number of tranng ponts, and v from 1 to dmenson of the data) L w P v = w v α y x = 0 (3.15) v LP b = α y = 0 (3.16)

36 25 LP ξ = C α μ = 0 (3.17) ( x w + b) y ξ (3.18) ξ 0 (3.19) α 0 (3.20) μ 0 (3.21) { y ( x w+ b) 1 + ξ } = 0 α (3.22) μ = 0 (3.23) ξ As before, we can use KKT complementarly condtons Eqs (3.22) and (3.23), to determne the threshold b. Note that Eq. (3.17) shows that ξ = 0 fα < C. Thus we can smply take any pont for whch 0 < α < C to use (3.22) (wth ξ = 0) to compute b [18] Nonlnear SVM - Kernel Method The soft margn classfer s an extenson of lnear SVM. The kernel method s a scheme to fnd the nonlnear boundares. The concept of the kernel method s transformaton of the vector space to a hgher dmensonal space. As can be seen from Fgure 2.7, by transformng the vector space from two-dmensonal to three-dmensonal space, the non-separable vectors can be separated.

37 26 Fgure 3-3 Transformaton to hgher dmensonal space 3.4. SVM Software The predcton of proten secondary structure s done usng SVM lght software. SVM lght software s the mplementaton of Vapnk s Support Vector Machne (Vapnk 1995) for the problem of pattern recognton, regresson and rankng functon. SVM lght software conssts of two parts, the frst part.e. s the svm_learn part takes care of the learnng module and the second part svm_classfy part does the classfcaton of the data after tranng. The nput data to both the parts should be gven n the followng format <lne>.=. <target> <feature>:<value> <feature>:<value> <target>.= <float> <feature>.=. <nteger> qd <value>.=. <float> The target value and each of the feature/value pars are separated by space character. Feature/value pars must be ordered by ncreasng feature number. Features wth value zero can

38 27 be skpped. For classfcaton, the target value denotes the class of the example +1 as the target value marks a postve example, -1 negatve example respectvely. So, for example, the lne -1 1:0.43 3: :0.9 Specfes a negatve example for whch feature number 1 has value o.43 feature number 3 has value 0.12 feature number 2345 has the value 0.9 and all the other features have value 0. the order of the predctons s the same as n the tranng data [20].

39 Secondary Structure Assgnment 4. RELATED SVM-BASED METHODS The secondary structure s converted from the expermentally determned tertary structure by DSSP [9], STRIDE [15] or DEFINE. In ths study, the DSSP scheme s used snce t s the most generally used secondary structure predcton method. The DSSP classfes resdues nto eght dfferent secondary structure classes: H (αhelx), G (3 10 helx), I (π-helx), E (β-strand), B (solated β-brdge), T (turn), S (bend), and (rest). In ths study, these eght classes are reduced nto three regular classes based on the followng Table 3.1 Table to-3 state reducton method n secondary structure assgnment DSSP Class 8-state symbol 3-state symbol Class name 3 10 helx α-helx π-helx G H I H Helx β-strand E E Sheet solated β-brdge Bend Turn Rest (connecton regon) B S T - C Loop 4.2. Tranng and Testng Data Sets data set s used. For comparng the results of ths study wth prevously publshed results [6], RS 126

40 29 The RS 126 data set s proposed by Rost & Sander [1] and accordng to ther defnton, t s non-homologous set. They used percentage dentty to measure the homology and defnes non-homologous as no two protens n the set share more than 25% sequence dentty over a length of more than 80 resdues. For each data set, the seven fold cross valdaton s done [1] [3] [6]. In the seven-fold cross valdaton test, one subset s chosen for testng and remanng 6 subsets are sued for tranng and ths process s repeated untl all the subsets are chosen for the testng. In Fgure 4.1, RS 126 set s dsplayed Fgure 4-1 RS 126 data sets

41 Data Pre-Processng Sldng Wndow Scheme To tran the SVM wth proten sequence and structural nformaton, a sldng wndow scheme s sued [6]. In ths sldng scheme, a wndow becomes one tranng pattern for predctng the structure of the resdue at the center of the wndow. And n ths tranng pattern, the nformaton about the local nteractons among neghborng resdues s embedded. Fgure 4.2 shows an example of ths scheme wth wndow sze of 5. Here to predct the structure of amno acd N, the sequence AKNLK goes together as one nput pattern. To predct structure of L, the next amno acd, the wndow sldes down to the next group of sequence, KNLKQ and so on. Proten sequence N A T A A K N L K Q D A T K S E R V A Secondary structure H H H H H H H C E C C H H H C C H H H Input pattern : Input pattern +1: Input pattern +2: N A T A A K N L K Q D A T K S E R V N A T A A K N L K Q D A T K S E R V N A T A A K N L K Q D A T K S E R V Sldng wndow Fgure 4-2 Sldng wndow scheme wth wndow length of 5

42 Orthogonal Input Profle The feature value of each amno acd resdue n a wndow means the weght (costs) of each resdue n a pattern. In ths study all weght assgnment schemes are not tested. The test results from Hu s method are used to select the best scheme. In Hu s study orthogonal method s used as reference for comparson wth dfferent encodng scheme. Among the dfferent schemes explaned n the prevous studes, the frst smplest way s to use the orthogonal encodng whch assgns a unque bnary vector to each resdue, such as (1, 0, 0, ), (0, 1, 0, ), (0, 0, 1 ) so on.[6] In ths method, the weghts of all the resdues n a wndow are assgned to 1 equally. The method s explaned as follows here for smplcty only sngle wndow encodng scheme s explaned. For multple wndows the same technque s followed for all the elements wth n the wndow and target s the center element of the mddle wndow. Fgure 4.3 shows the sample tranng data to whch orthogonal encodng method s appled as an nput profle wth wndow sze 5. In Fgure 3.5, the value of the frst column, {-1, +1}, are target values of each bnary classfer. For example, f the bnary classfer s the one whch classfes helx or not, and f the structure of the resdue from the tranng data s helx, (.e. the center value has H, G or I correspondng to ts poston) the target value becomes +1. After the target value, ndces of each amno acd come out wth the weght. At the frst row of Fgure 3.4, 16,37,41,61 and 84 are ndces of each amno acd when the wndow sze equals 5 and the value of 1s next to these are the weghts of each amno acd. And these weghts are all equal n orthogonal encodng scheme.

43 32 Snce 20 amno acds take one bnary bt each, as shown n Fgure 3.5, the dmenson of one nput pattern becomes as follows: One vector dmenson = (20 bnary bts) x (5 resdues) Wndow sze = 100 Therefore, those ndces of 16,37,41,61 and 84 mean, (0, 0, 1, 0, 0, 0, 0) S (0, 0, 1, 0, 0, 0) T (1, 0, 0 ) A (1, 0, 0 ) A (0,0,0,1,0,.) D Fgure 4-3 Sample Tranng Data wth Orthogonal Encodng of Wndow Sze 5.

44 33 ARNDCQEGHILKMFPSTWYV S T A A D Q M Proten Sequence Fgure 4-4 An Example of Orthogonal Vector Profle. Other proflng technques lke Physco-Chemcal Property Based Input Profle1, Physco- Chemcal Property Based Input Profle 2, Hydrophobcty Matrx Input Profle and BLOSUM62 Matrx Input Profle were tested n Hu s study [6]. Among them only one usng BLOSUM62 matrx was shown to have better accuracy compared to orthogonal encodng scheme. So n ths study only BLOSUM62 matrx Input Profle s used. Combnaton of two profles n hybrd encodng scheme, also dscussed n Hu s study, among them the combnaton of orthogonal encodng scheme wth BLOSUM62 matrx was shown to have the best results. Ths paper deals manly wth the above mentoned encodng scheme.

45 BLOSUM62 Matrx Input Profle The BLOSUM matrces orgnate from the paper by Henkoff and Henkoff (1992). [21] Ther dea was fndng a good measure of dfference between two protens specfcally for more dstantly related protens. The value n the BLOSUM62 matrx are log-odds scores for the lkelhood that a gven amno acd par wll nterchange. Amno acds wth smlar physcal propertes are more lkely to replace one another than dssmlar amno acds. So the conservatve hydrophobc exchange such as I (Ile) to L (Leu) has a postve score whereas changng I to the hydrophlc resdue N (Asn) receves a negatve score, whch means that ths knd of nterchange s not lkely to occur. [6] More detaled nformaton about the BLOSUM62 matrx s gven n Appendx. In ths research, nstead of usng the orthogonal vector, the BLOSUM62 matrx s appled as an nput profle. But snce the range of BLOSUM62 matrx value s {-4,11], for the proper encodng of the SVM, the range should be converted to [0, 1] n the preprocessng step. Hence, for ths converson, the followng two functons are appled and compared. Functon (4.1) s a smple lnear functon wth range converson from [-4, 11] to [0, 1] and Functon (4.2) s an exponental functon. As the BLOSUM62 matrx s a log odd matrx wth base 2, the exponental functon, such as (4.2), s desgned. And f functon (4.2) s appled, the values of BLOSUM62 matrx over 6 are assgned to 1 regardless of the fnal value.

46 f ( x) = x + (4.1) ( x) 2 2 x f = (4.2) 10 Where, x s the value from the raw profle matrx Combned Input Profle To obtan the optmal nput profle whch offer the most nformatve feature to predct the secondary structure wth hgh accuracy, the prevous nput profle are combned together. When more than one encodng scheme s used the weght s appled based on a poston nsde a wndow. In other words, even though each amno acd has 20 dfferent log odds scores, those values are always same regardless of the poston nsde the sldng wndow. Therefore by assgnng dfferent weghts based on ther poston nsde the wndow, the machne could be traned wth more specfc nformaton Parameter Optmzaton of the Bnary Classfer To acheve hgh testng accuracy, a sutable kernel functon, ts parameter and the regularzaton parameter C should be properly selected. In Fgure 4.5, typcal kernel functons are shown. Draper [14] compared the kernel for proten secondary structure predcton wth SVM and concluded that the polynomal and Gaussan kernels, also known as RBF kernel, were the best.

47 36 Hua and Sun [3] has proved that the Gaussan kernel can provde superor performance n the generalzaton ablty and convergence speed. Therefore, n ths study, accordng to the prevous result, Gaussan radal bass functon kernel was adopted. Once the kernel functon s selected, the parameter of the kernel functon, γ, and the regularzaton parameter, C whch controls the trade-off between complexty and msclassfed tranng example, should be specfed by the user. The soft-margn SVM maxmzes margn wth mnmzng tranng error smultaneously by solvng the followng optmzaton problem Mnmze, l ξ, w b w w + C = 1 ξ Subject to y ( w x + b) 1 ξ, = 1, l..., ξ 0, = 1,..., l Where, x represents an nput vector, y = +1, or 1 accordng to whether x s postve class or negatve class, l s the number of the tranng data, and C s the regularzaton parameter that controls the trade-off between margn and classfcaton error represented by slack varable ( ξ ). The correspondng dual quadratc programmng problem wth the applcaton of a kernel functon K ( x, x j ) was wrtten as:

48 37 λ λ λ j 2, j j ( x x ) 1 Maxmze y y K, j Subject to 0 λ C λ y = 0 The dual formulaton of the soft margn SVM wth regularzaton parameter C shows that the nfluence of a sngle tranng example s lmted by C. And t s known that a large C value s mposes a hgh penalty to classfcaton errors For the proper choce of C value and γ value Gaussan radal bass functon, e γ x y 2 the prevous studes tested dfferent γ s and upper bound value of C over ther own data sets and selected the pars whch shows the best accuracy [22] [4] [3]., Smple dot product: K ( x, y ) = x y Vovk s polynomal: ( x y ) p K ( x, y ) = + 1 Radal bass functon (RBF): K ( x, y ) = e γ 2 x y Two layer neural network: K ( x, y ) ( kx ) = tanh y Fgure 4-5 Typcal kernel Functons n SVM

49 Bnary Classfer Constructon Sx SVM bnary classfer ncludng three one-versus-rest classfer ( one : postve class, rest : negatve class) names H/~H, E/~E and C/~C and three one-versus-one classfer named H/E, E/C, C/H were constructed. For example, the classfer H/E s constructed on the tranng samples havng helces and sheets and t classfes the testng sample as helx or sheet. The programs for constructng the SVM bnary classfer were wrtten n the C language Tertary Classfer Desgn There are many ways to combne the output from the bnary classfer for secondary structure predcton. In ths research, several tertary classfers proposed by prevous studes [3] [22] were tested and compared wth the new tertary classfer of ths study. Here, the new tertary classfer s desgned based on the results of three one-versus-one bnary classfer Tree-based tertary classfer [3] Ths method s based on one-versus-one bnary classfers (H/~H, E/~E and C/~C) and three one-versus-one classfers (H/E, E/C and C/H). Wth these classfers, three cascade tertary classfers, TREE_HEC (H/~H, E/C), TREE_ECH (E/~E, C/H), TREE_CHE (C/~C, H/E) were made. These tree-based classfers are shown n Fgure 4.6

50 39 H/~H No (<0) C/~C No (<0) Yes (>0) E/C No (<0) Yes (>0) H/E No (<0) Yes (>0) Yes (>0) H E C (a) TREE_HEC C H E (b) TREE_CHE E/~E No (<0) Yes (>0) C/H Yes (>0) No (<0) E C H (c) TREE_ECH Fgure 4-6 Tree based tertary classfers

51 Smple votng tertary classfer (SVM_VOTE) [3] In ths method, all sx bnary classfers are combned by usng a smple votng scheme n whch the testng sample s predcted to be state ( s among H, E and C) f the largest number of the sx bnary classfers classfy t as state. In case the testng samples have two classfcatons n each state, t s consdered to be a col SVM_MAX_D [3] In ths classfer, the three one-versus-rest classfers (H/~H, E/~E, C/~C) are combned for handlng the mult-class case. And the class of a testng sample (H, E or C) was assgned to the one whch presents the largest postve dstance from the optmal separatng hyperplane (OSH). For example, f the dstance values of the each one-versus-rest classfers (H/~H, E/~E, C/~C) are -1.7, 1.2 and 2.5 respectvely, as negatve dstance of H/~H bnary classfer doesn t gve any nformaton for decson, only two postve values (1.2, 2.5) are compared. Fnally, the class for the test sample s assgned to col because 2.5 s the largest between the two values Drected Acyclc Graph (DAG) based tertary classfer [22] As shown n Fgure 3.8, ths classfer s based on three one-versus-one classfers (H/E, E/C and C/H). Many test results show that one-versus-one classfers are more accurate than one-versus-rest classfers due to the fact that the one-versus-rest scheme often need to handle two data sets wth very dfferent szes,.e. unbalanced tranng data [23] [24].

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.