Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction

Size: px

Start display at page:

Download "Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction"

Toby Jennings
6 years ago
Views:

1 Georga State Unversty Georga State Unversty Computer Scence Dssertatons Department of Computer Scence Clusterng System and Clusterng Support Vector Machne for Local Proten Structure Predcton We Zhong Follow ths and addtonal works at: Recommended Ctaton Zhong, We, "Clusterng System and Clusterng Support Vector Machne for Local Proten Structure Predcton." Dssertaton, Georga State Unversty, Ths Dssertaton s brought to you for free and open access by the Department of Computer Scence at Georga State Unversty. It has been accepted for ncluson n Computer Scence Dssertatons by an authorzed admnstrator of Georga State Unversty. For more nformaton, please contact scholarworks@gsu.edu.

2 CLUSTERING SYSTEM AND CLUSTERING SUPPORT VECTOR MACHINE FOR LOCAL PROTEIN STRUCTURE PREDICTION by We Zhong Under the Drecton of Y Pan ABSTRACT Proten tertary structure plays a very mportant role n determnng ts possble functonal stes and chemcal nteractons wth other related protens. Expermental methods to determne proten structure are tme consumng and expensve. As a result, the gap between proten sequence and ts structure has wdened substantally due to the hgh throughput sequencng technques. Problems of expermental methods motvate us to develop the computatonal algorthms for proten structure predcton. In ths work, the clusterng system s used to predct local proten structure. At frst, recurrng sequence clusters are explored wth an mproved K-means clusterng algorthm. Carefully constructed sequence clusters are used to predct local proten structure. After obtanng the sequence clusters and motfs, we study how sequence varaton for sequence clusters may nfluence ts structural smlarty.

3 Analyss of the relatonshp between sequence varaton and structural smlarty for sequence clusters shows that sequence clusters wth tght sequence varaton have hgh structural smlarty and sequence clusters wth wde sequence varaton have poor structural smlarty. Based on above knowledge, the establshed clusterng system s used to predct the tertary structure for local sequence segments. Test results ndcate that hghest qualty clusters can gve hghly relable predcton results and hgh qualty clusters can gve relable predcton results. In order to mprove the performance of the clusterng system for local proten structure predcton, a novel computatonal model called Clusterng Support Vector Machnes (CSVMs) s proposed. In our prevous work, the sequence-to-structure relatonshp wth the K-means algorthm has been explored by the conventonal K-means algorthm. The K-means clusterng algorthm may not capture nonlnear sequence-to-structure relatonshp effectvely. As a result, we consder usng Support Vector Machne (SVM) to capture the nonlnear sequence-tostructure relatonshp. However, SVM s not favorable for huge datasets ncludng mllons of samples. Therefore, we propose a novel computatonal model called CSVMs. Takng advantage of both the theory of granular computng and advanced statstcal learnng methodology, CSVMs are bult specfcally for each nformaton granule parttoned ntellgently by the clusterng algorthm. Compared wth the clusterng system ntroduced prevously, our expermental results show that accuracy for local structure predcton has been mproved notceably when CSVMs are appled. INDEX WORDS: K-means clusterng algorthm, PISCES (Proten Sequence Cullng Server), HSSP (Homology-Derved Secondary Structure of Protens), sequence motf, hydrophobcty ndex, evolutonary dstance, PDB (Proten Data Bank), SVM (Support Vector Machne), proten structure predcton, granular computng.

4 CLUSTERING SYSTEM AND CLUSTERING SUPPORT VECTOR MACHINE FOR LOCAL PROTEIN STRUCTURE PREDICTION by WEI ZHONG A Dssertaton Submtted n Partal Fulfllment of Requrements for the Degree of Doctor of Phlosophy In the College of Arts and Scences Georga State Unversty 2006

5 Copyrght by We Zhong 2006

6 CLUSTERING SYSTEM AND CLUSTERING SUPPORT VECTOR MACHINE FOR LOCAL PROTEIN STRUCTURE PREDICTION by WEI ZHONG Major Professor: Commttee: Y Pan Phang C. Ta Robert Harrson Martn D. Fraser Electronc Verson Approved: Offce of Graduate Studes College of Arts and Scences Georga State Unversty August 2006

7 v ACKNOWLEDEMENTS The dssertaton would not have been possble wthout the help of so many people. I would lke to take ths opportunty to express my deep apprecaton to all those who helped me n ths hard but extremely rewardng process. Frst, I would lke to thank my advsor, Professor Y Pan, for all hs help, advce, support, gudance, and patence. Whenever I have dffcultes and problems, Dr. Pan always encourages me to make harder efforts so that I can overcome those problems. Dr. Pan provdes me so many valuable advces about how to conduct research and choose my career. Wthout hs help, I could not make rapd progress n my research and dssertaton wrtng. I am grateful to Dr. Robert Harrson, Dr. Phang. C. Ta, and Dr. Martn D. Fraser for servng on my Ph.D commttee, and for ther tme and cooperaton n revewng ths work. Dr. Harrson provdes me nsghts nto how to cluster the sequence segments and how to express my deas clearly n my paper. I would lke to thank Dr. Fraser for gvng me many valuable suggestons about statstcal technques for SVM learnng. I would lke to thank Dr. Ta for provdng me a lot of bologcal knowledge so that I can combne the computatonal methods wth bologcal experments smoothly. I would lke to thank Professor Roland L. Dunbrack for provdng the data set from PISCES. Ths research was supported n part by the U.S. Natonal Insttutes of Health under grants R01 GM S1, and P20 GM A1, and the U.S. Natonal Scence Foundaton under grants ECS , and ECS I am also supported by a Georga State Unversty Molecular Bass of Dsease Program Fellowshp.

8 v I would lke to thank Dr. Raj Sunderraman for hs great patence and support durng my job search and Ph.D. study. The more mportant thanks are reserved for the last. Many thanks to my parents, Zhong Shunlong and We Shufang, for ther constant support, concern, and motvaton. My parents gve many mportant advces about how to lead a successful lfe durng my Ph.D. study.

9 v TABLE OF CONTENTS ACKNOWLEDEMENTS... v TABLE OF CONTENTS...v LIST OF FIGURES... x LIST OF TABLES... XI LIST OF ACRONYMS...XII Chapter 1 Introducton Research Motvatons and Contrbutons Local Proten Structure Predcton Clusterng System for Local Proten Structure Predcton Clusterng Support Vector Machne for Local Proten Structure Predcton Dssertaton Organzaton Chapter 2 Proten Structure Predcton Proten Structure Representatons and Proten Structure Determnaton Comparatve Homology Modelng Threadng or Fold Recognton Ab Into Methods Chapter 3 Dscovery of Sequence Clusters and Sequence Motfs wth Improved K-means Algorthms Several Major Motf Dscovery Methods K-means Clusterng Algorthms Tradtonal K-means Clusterng Algorthm New Greedy Intalzaton Method for the K-means Algorthm Experment Setup Expermental Parameters Data Set Generaton and Representaton of Sequence Segments Evolutonary Dstance and Cluster Membershp Calculaton for Sequence Segments Secondary Structure Assgnment Measure of Structural Smlarty for a Gven Cluster Evaluaton of Performance for the Clusterng Algorthm and Generaton of Frequency Profles for Sequence Motfs Expermental Results Comparson of Performance for the Tradtonal and Improved K-means Algorthm Sequence Motfs Result Comparson wth Other Research Chapter 4 Parallel K-means Algorthm usng Pthread and OpenMP over Hyper-Threadng Technology Parallelzaton Hyper-Threadng Technology Pthread and OpenMP Programmng Envronment and Implementaton Detals Comparng Pthread and OpenMP Implementatons Chapter 5 Relatonshp between Sequence Varaton and Correspondng Structural Smlarty for Sequence Clusters and Sequence Motfs... 66

10 5.1 Prevous Studes for Sequence and Structural Varaton of Sequence Clusters Expermental Setup Recurrent Clusterng Data Set Clusterng of Sequence Segments n the Sequence Space Generaton of Frequency Profle for Sequence Clusters Evaluaton of Dstrbuton of Amno Acd for Each Poston of Frequency Profle Measure of Sequence Varaton for a Gven Sequence Cluster Measure of Sequence Varaton by Average of Relatve Entropy Values for All Poston of Sequence Profles Measure of Sequence Varaton by the Number of Important Postons for Sequence Profles Measure of Secondary Structure Smlarty for a Gven Sequence Cluster Measure of Tertary Structural Varaton by dmrsms_sc for A Gven Sequence Cluster Average Dstance Matrx (ADM) among Sequence Segments for a Gven Sequence Cluster dmrmsd_sc for a Gven Sequence Cluster Results Analyss Chapter 6 Local Proten Structure Predcton by the Clusterng System Data Set and Sequence Segment Generaton Tranng Set and Independent Test Set Clusterng of Sequence Segments Belongng to the Tranng Set Representatve Structure for Each Cluster Representatve Secondary Structure Average Dstance Matrx (ADM) Representatve Torson Angle Local Structure Predcton by the Clusterng System Dstance score of a Gven Sequence Segment for Each Cluster Relablty Score of Each Cluster for a Gven Sequence Segment Structure Predcton by Dstance Score and Relablty Score for a Gven Sequence Segment Predcton Accuracy Calculaton Secondary Structure Accuracy Dstance Matrx Root Mean Square Devaton (dmrmsd) Torson Angle RMSD (tarmsd) Classfcaton of Clusters nto Dfferent Groups Expremental Results Chapter 7 Support Vector Machne Optmal Hyperplane for Separable Case Optmzaton Problem to Buld Optmal Hyperplane Some Propertes of Hyperplane and One Algorthm to buld Optmal Hyperplane Optmal Hyperplane for Nonseparable Sets Δ-margn Separatng Hyperplanes Soft Margn Generalzaton Expected Rsk Bounds for Optmal Hyperplane v

11 7.4 Mercer s Theorem to Deal wth Hgh Dmensonalty Fundamental Concept of SVM Mercer s Theorem for Hgh Dmensonalty Constructon of SVM Constructng SVM wth Quadratc Optmzaton Constructng SVM usng Lnear Optmzaton Method SVM Kernels Selecton of SV Machne Usng Bounds Polynomal Functons Radal Bass Functons Two-layer Neural Networks Multclass Classfcaton Chapter 8 Implementaton of SVM for a Very Large Dataset Frst Class of Algorthms for Large Dataset Decomposton Algorthm Sequental Mnmal Optmzaton (SMO) Boostng Algorthm to Scale up SVM Second Class of Algorthm for Large Dataset Tranng Random Selecton Actve Learnng wth SVM Classfyng Large Datasets usng SVM wth Herarchcal Clusters Chapter 9 Clusterng Support Vector Machnes for Proten Local Structure Predcton Revew of Prevous Work Method Granular Computng K-means Clusterng Algorthm as the Granulaton Method Generaton of Sequence Segments by the Sldng Wndow Method Dstance Score and Relablty Score of a Gven Sequence Segment Cluster Membershp Assgnment for Each Sequence Segment Support Vector Machne Clusterng Support Vector Machnes (CSVMs) Advantages of CSVMs Tranng CSVMs for Each Cluster Local Proten Structure Predcton by CSVMs Expermental Setup Tranng Set and Independent Test Set Predcton Accuracy Calculaton for Each Sequence Segment Classfcaton of Clusters nto Dfferent Groups Expermental Results and Analyss Average Accuracy, Precson and Recall of CSVMs for Dfferent Cluster Group Comparson of Independent Predcton Accuracy for Dfferent Cluster Groups n Terms of Three Metrcs between the Clusterng Algorthm and the CSVM Model Comparson of Accuracy Crtera One and Accuracy Crtera Two between the Clusterng System and the CSVMs Model Summary Chapter 10 Conclusons and Future Work v

12 Bblography x

13 x LIST OF FIGURES Fgure 1. Two Physcal Processors and Four Logcal Processors Fgure 2. Four Physcal Processors Behavng Lke Eght Logcal Processors Fgure 3. Implementaton Detals of Fve Pthreads Fgure 4. Pthread Code Fgure 5. OpenMP Code Fgure 6. Speedup Values for Pthread and OpenMp Fgure 7 Relatonshp between Varablty and the Relatve Entropy for Each Poston of Sequence Profles for Sequence Cluster Fgure 8 Percentages of Sequence Clusters wth the Specfed Number of Important Postons n the Specfed Ranges of Secondary Structure Smlarty Fgure 9 Comparson of the Important Postons between the Percentage of Clusters Wth dmrmsd_sc > 2.0 Å and the Percentage of Clusters Wth dmrmsd_sc < 1.5 Å Fgure 10 Secondary Structure Accuracy for the Clusterng System Fgure 11 dmrmsd for the Clusterng System Fgure 12 tarmsd for the Clusterng System Fgure 13 Accuracy Crtera One for the Clusterng System Fgure 14 Accuracy Crtera Two for the Clusterng System Fgure 15 The CSVMs Model Fgure 16 Comparson of Accuracy, Precson and Recall of CSVMs Fgure 17 Comparson of Secondary Structure Accuracy between the Clusterng System and CSVMs Model Fgure 18. Comparson of dmrmsd between the Clusterng System and CSVMs Model Fgure 19. Comparson of tarmsd between the Clusterng System and CSVMs Model Fgure 20. Comparson of Accuracy Crtera One between the Clusterng System and The CSVMs Model for Dfferent Cluster Groups Fgure 21. Comparson of Accuracy Crtera Two between the Clusterng System and The CSVMs Model for Dfferent Cluster Groups

14 XI LIST OF TABLES Table 1. Comparson of the Percentage of Sequence Segments Belongng to Clusters wth Hgh Structural Smlarty Table 2. Comparson of the Number of Clusters wth Hgh Structural Smlarty Table 3 Standard to Classfy Clusters nto Dfferent Groups Table 4 the Threshold for Evaluatng Accuracy Crtera One and Accuracy Crtera Two for Each Cluster... 85

15 XII LIST OF ACRONYMS Clusterng Support Vector Machnes Support Vector Machne Nuclear Magnetc Resonance The Basc Local Algnment Search Tool Proten Structure Predcton and Evaluaton Computer Toolkt Crtcal Assessment of Technques for Proten Structure Predcton Poston Specfc Iteratve BLAST Poston Specfc Scorng Matrx Structural Classfcaton of Protens Root Mean Squared Devaton Hdden Markov Models Proten Sequence Cullng Server Homology-derved Secondary Structure of Protens Proten Data Bank Smultaneous Multthreadng Average Dstance Matrx Sequental Mnmal Optmzaton Dstance Matrx Root Mean Square Devaton Torson Angle Root Mean Square Devaton CSVMs SVM NMR BLAST PROSPECT CASP PSI-BLAST PSSM SCOP RMSD HMM PISCES HSSP PDB SMT ADM SMO dmrmsd tarmsd

16 1 Chapter 1 Introducton 1.1 Research Motvatons and Contrbutons Local Proten Structure Predcton Protens are polymers of amno acds connected by formaton of covalent peptde bonds. Protens have four levels of structures ncludng prmary structure, secondary structure, tertary structure and quaternary structure. Based on hydrogen bondng nteractons between adjacent amno acd resdues, the polypeptde chan can arrange tself nto secondary structure. The polypeptde chans of proten molecules fold nto the natve structure. Multple nteractng polypeptde chans of characterstcs tertary structure develop nto proten quaternary structure. Proten structure can be determned expermentally by X-ray crystallography, Nuclear Magnetc Resonance (NMR) and electron mcroscopy. When X-ray crystallography s appled, crystallsaton of protens s a very dffcult task. Compared to X-ray crystallography, experments related to NMR are carred out n soluton rather than a crystal lattce. However, NMR can only be applcable to determne structures of small and medums-szed molecules due to lmtaton of the prncple that make NMR possble. Knowledge about proten functons can be used to nfer how the proten nteracts wth other molecules. The proten functons are largely determned by ther structures. As a result, understandng proten structures s a very mportant task. Determnaton of proten structure by expermental methods s a long and tedous process. Dffcultes of determnng proten

17 2 structures expermentally requre us to predct proten structures usng computatonal methods. Comparatve homology modelng, threadng, and Ab Into method are three major methods for proten structure predcton. The classfcaton of these three major methods s based on how each method utlzes the avalable resources n the current database. Comparatve homology modelng produces the best predcton results so far. The tertary structure and functons are hghly conserved durng the evolutonary process. As a result, proten sequences wth hgh sequence smlarty usually share smlar structures. The predcton accuracy of homology modelng depends on whether proten sequences n the proten data bank that have hgh sequence smlarty wth target proten sequences can be found. Sequence algnment algorthms are used to fnd proten sequences sharng hgh smlarty wth target sequences whose structure to be predcted. Based on sequence algnment algorthms, the algned resduals of the structure templates from proten sequences sharng hgh smlarty wth target sequences are used to construct the structural model. In ths process, the qualty of sequence algnment algorthms s the key factor to determne whether sutable structural templates can be selected and how well the target proten can be algned wth structural templates. For the comparatve homology modelng, local sequence algnment s used to fnd out segments of the proten sequences wth hgh smlarty. Local sequence algnment ncludes parwse algnment and profle-based algnment. Profle-based methods perform much better than the parwse comparson such as the Basc Local Algnment Search Tool (BLAST) when sequence smlarty s less than 30%. If sequence algnment algorthms cannot fnd correct folds for the target sequence, threadng or fold recognton can be utlzed to provde the correct folds to the target sequence. Based on the concept that only a small number of dstnct proten folds exst for proten famles, a lbrary

18 3 of representatve local structures s scanned n order to fnd structure analogs to proten sequences. After the lbrary s set up, the energy functon s used to select the sutable lbrary entres servng as the templates for target sequences. Proten Structure Predcton and Evaluaton Computer Toolkt (PROSPECT) s one of the best threadng programs n the Crtcal Assessment of Technques for Proten Structure Predcton (CASP) competton (Xu et. al., 2001). The threadng methods are computatonally expensve because each entry of the lbrary havng thousands of possble folds s requred to be algned n all possble ways. The energy functon used n threadng methods are not sophstcated enough to fnd the correct proten folds. Ab Into methods can be used to predct proten structures from the sequence nformaton when approprate structure templates cannot be found. Most Ab Into predcton methods restrct the conformaton space to the reasonable sze usng reduced proten representaton and select those energy functons related to the most mportant nteractons responsble for proten foldng n ts natve form Clusterng System for Local Proten Structure Predcton Recurrng sequence motfs of protens are explored wth an mproved K-means clusterng algorthm. Informaton about local proten sequence motfs s very mportant to the analyss of bologcally sgnfcant conserved regons of proten sequences. These conserved regons can potentally determne the dverse conformaton and actvtes of protens. Carefully constructed sequence motfs from sequence clusters are used to predct local proten structure. PROSITE, PRINTS and PFAM are popular methods to create sequence motfs. Snce sequence motfs and profles of PROSITE, PRINTS and PFAM are developed from multple sequence algnments, these sequence motfs and profles only search conserved elements of

19 4 sequence algnments from the same proten famly and carry lttle nformaton about conserved sequence regons, whch transcend proten famles. Furthermore, the knowledge about the bologcally mportant regons or resdues s the precondton of fndng these motfs. As a result, the dscovery of sequence motfs and profles requres ntensve human nterventon. Whle these methods to produce the popular sequence motfs requre human nterventon to explore the bologcally sgnfcant regons of proten sequences, the clusterng technque provdes an automatc, unsupervsed dscovery process. All these advantages, n comparson to these methods to create popular sequence motfs, motvate us to develop an mproved K-means clusterng algorthm. Han and Baker have used the K-means clusterng program to fnd recurrng local sequence motfs for protens (Han and Baker, 1995; Han and Baker, 1996). In ther work, a set of ntal ponts for cluster centers s chosen randomly (Han and Baker, 1995). Snce the performance of K-means clusterng s very senstve to ntal pont selecton (Jan, Murty and Flynn, 1999), ther technque may not yeld satsfactory results. To overcome potental problems of random ntalzaton, the new greedy ntalzaton method tres to choose sutable ntal ponts so that fnal parttons can represent the underlyng dstrbuton of the data samples more consstently and accurately (Zhong et.al, 2004b). Each ntal pont s represented by one local sequence segment. In the new ntalzaton method, the clusterng algorthm wll only be performed for several teratons durng each run. After each run, ntal ponts, whch can be used to form the cluster wth good structural smlarty, are chosen and ther evolutonary dstance s checked aganst that of all ponts already selected n the ntalzaton array. If the mnmum evolutonary dstance of new ponts s greater than the specfed dstance, these ponts wll be added to the ntalzaton array. Satsfacton of the mnmum evolutonary dstance can guarantee that each

20 5 newly selected pont wll be well separated from all the exstng ponts n the ntalzaton array and wll potentally belong to dfferent natural clusters. Ths process wll be repeated several tmes untl the specfed number of ponts s chosen. After ths procedure, these carefully selected ponts can be used as the ntal centers for the K-means clusterng algorthm. Analyss of the clusterng process of the tradtonal clusterng algorthm reveals that some of the ntal ponts are very close to each other, creatng strong nterferences wth each other. Strong nterferences among ntal ponts wll affect fnal parttonng negatvely. The results of our mproved K-means algorthm show the average percentage of sequence segments belongng to clusters wth structural smlarty greater than 60% steadly mproves wth ncreasng mnmum evolutonary dstances among ntal ponts. Ths mproved percentage results from decreased nterferences among ntal ponts when the evolutonary dstances among ntal ponts are ncreased. Comparson between sequences motfs obtaned by both algorthms suggests that the mproved K-means clusterng algorthm may dscover some relatvely weak and subtle sequence motfs. These motfs are undetectable by the tradtonal K-means algorthm because random selecton of ponts may choose two startng ponts that are wthn one natural cluster. For example, some of the weak amphpathc helces and sheets dscovered by the mproved K-means algorthm have not been reported n the lterature. In addton, the number of repeated substtuton patterns of sequence motfs found by the tradtonal K-means algorthms s less than that of the mproved K-means algorthms. Our results reveal much more detaled hydrophobcty patterns for helces, sheets and cols than the prevous study (Han and Baker, 1995). These elaborate hydrophobcty patterns are supported by varous bochemcal experments. Increased nformaton about hydrophobcty patterns assocated wth these sequence motfs can expand our knowledge of how protens fold

21 6 and how protens nteract wth each other. Furthermore, the analyss of dscovered sequence motfs shows that some elaborate and subtle sequence patterns such as Pattern 1, 9, 22 have never been reported n prevous works. Especally, ncreased number of repeated substtuton patterns reported n ths study may provde addtonally strong evdences for structurally conservatve substtutons durng the evolutonary process for proten famles. The sequence motfs dscovered n ths study ndcate conserved resdues that are structurally and functonally mportant across proten famles because proten sequences used n ths study share less than 25% sequence denttes. These mportant features from our sequence motfs may help to compensate for some of the weak ponts of those created by PROSITE, PRINTS, PFAM and BLOCKS (Attwood et al., 2002; Henkoff, Henkoff and Petrokovsk, 1999; Sonnhammer et.al., 1998). Our sequence motfs may reflect general structural or functonal characterstcs shared by dfferent proten famles whle sequence motfs from PROSITE, PRINTS, PFAM and BLOCKS represent structural or functonal constrants specfc to a partcular proten famly. Due to the hgh throughput sequencng technques, the number of known proten sequences has ncreased rapdly n recent years. However, nformaton about functonally sgnfcant regons of these new protens may not be avalable. As a result, automatc dscovery of bologcally mportant sequence motfs n ths study s a much more powerful tool to explore underlyng correlatons between proten sequences, structures and functons than other methods requrng gudance from exstng scentfc results. In our study, the cluster number of 800 s chosen emprcally. However, 800 may not be the optmal cluster number. Therefore, the mproved K-means algorthm wll be run several tmes wth dfferent values of k n order to dscover the most sutable number of clusters. Wth the nformaton about the optmal cluster number, clusterng results may be potentally closest to

22 7 underlyng dstrbuton patterns of the sample space. However, the tme spent searchng for the good ntal ponts grows substantally when the mnmum evolutonary dstance and structural smlarty threshold are ncreased. For example, t wll take 18 days to obtan approprate ntal ponts wth the dstance threshold of 1500 when the sample sze s very large. Due to the tme and processng power constrants, the search for the optmal cluster number has not been completed. The long searchng tme for ntal ponts motvates us to mplement the parallel K- means algorthm n order to reduce the searchng tme for sutable ntal ponts to one to two days. The parallelzaton of the mproved K-means algorthm wll make exploraton of the optmal cluster number possble. We predct that the performance gans for the mproved K- means algorthm wll be ncreased further after the optmal cluster number s found. As a result, Pthread and OpenMP are employed to parallelze K-means clusterng algorthm n the Hyper- Threadng enabled Intel archtecture. Speedup for 16 Pthreads s 4.3 and speedup for 16 OpenMP threads s 4 n the 4 processors shared memory archtecture. Wth the new parallel K-means algorthm, K-means clusterng can be performed for multple tmes n reasonable amount of tme. Our research also shows that Hyper-Threadng technology for Intel archtecture s effcent for ths parallel bologcal algorthm. After we propose an mproved K-means clusterng algorthm to dscover the sequence clusters and sequence motfs automatcally and to mplement the parallel K-means clusterng algorthm, we want to dscuss how sequence varaton for sequence clusters may nfluence ts structural smlarty. Analyss of the relatonshp between the sequence varaton and correspondng structural varaton for sequence clusters s one of open questons for proten structure and sequence analyss (Rahman and Zomaya, 2005). Some researchers have evaluated the structural varaton for sequence clusters. Kasuya and Thornton (1999) and Jonassen et al.

23 8 (1999) have used crmsd to analyze structural varaton for sequence motfs. Bystroff and Baker (1998) have used the K-means clusterng algorthm to fnd sequence clusters and to assess structural varaton for these sequence clusters. Bystroff and Baker ncorporated structural nformaton durng the clusterng process (1998). As a result, fnal sequence clusters are contamnated by usage of structural nformaton durng the clusterng process. Our mplementaton of the K-means clusterng s sgnfcantly dfferent from Bystroff s work (1998) because we only use recurrent clusters and do not nclude structural nformaton n the clusterng process. To the best of our knowledge, no researchers have conducted n-depth analyss of the relatonshp between sequence varaton and correspondng structural varaton for sequence clusters (Zhong et.al, 2005a). Ths work focuses on systematc and detaled analyss of the relatonshp between sequence varaton and correspondng structural varaton for sequence clusters. Understandng ths relatonshp s very mportant to mprove the qualty of local sequence algnment and low homology proten foldng. Sequence clusters wth tght sequence varaton can be used to establsh structural templates for low homology proten foldng. Frequency profle of sequence clusters wth tght sequence varaton also can be used to fnd sequence segments wth smlar local structure n the local sequence algnment algorthm. Snce the average of relatve entropy values for all postons of frequency profles cannot determne the sequence varaton for sequence clusters, we use the number of mportant poston to defne the sequence varaton for sequence clusters. If the relatve entropy n the specfed poston of the frequency profles s greater than 0.2, ths poston s defned as the mportant poston for frequency profles. Our statstcs ndcate that an average of fve amno acds occupy 60% of the frequency space f the relatve entropy n that poston of the frequency profles s

24 9 greater than 0.2. Statstcally, each of twenty amno acds may occur wth the frequency of 5%. Therefore, fve amno acds may occupy 25% of the frequency space. As a result, the dstrbuton of amno acds s hghly dsproportonate n the mportant postons. The number of mportant postons s used to ndcate the extent of sequence varaton for sequence clusters. Increased number of mportant postons n the frequency profles reflects more postons n the frequency profles have hghly dsproportonate dstrbuton of 20 amno acds. As a result, sequence varaton for sequence clusters s more compact. In contrast, relatvely small number of mportant poston ndcates the sequence varaton for sequence clusters s wde. Our results ndcate that defnng sequence varaton for sequence clusters by the number of mportant poston s more effectve n dstngushng the sequence clusters wth hgh structural varaton and low structural varaton. The sequence varaton and structural varaton of sequence clusters havng sequence segments wth the specfed length are analyzed separately. The length of sequence segments ranges from 5 to 15 n our study. Sequence clusters havng sequence segments wth dfferent lengths show the smlar relatonshp between sequence varaton and structure varaton for sequence clusters. Due to lmtaton of space, we focus on the sequence cluster contanng sequence segments wth the length of nne. All the results shown n the followng are related to the sequence clusters havng sequence segments wth the length of nne. Analyss of our results reveals that on average, the number of mportant postons for clusters wth low structural varaton s greater than the number of mportant postons for clusters wth hgh structural varaton. Low structural varaton for sequence clusters ndcates that structural varaton s compact. A large number of mportant postons ndcate that sequence varaton for sequence clusters s tght. In other words, our results ndcate the mportant pattern that sequence

25 10 clusters wth tght sequence varaton tend to have tght structural varaton and sequence clusters wth wde sequence varaton tend to have wde structural varaton. After we explan the mproved K-means algorthm for sequence motf dscovery and how sequence varaton for sequence clusters may nfluence ts structural smlarty, the clusterng system s developed for local proten structure predcton. Our prelmnary results show that the sequence segments wth the length of nne are long enough to have some structural features and are short enough to have a statstcally sgnfcant number of samples. It s clear that other segment lengths are mportant and the analyss presented here can be appled to them as well. Due to huge amount of computaton, we plan to analyze the sequence segments from the length rangng from 5 to 15 n the next step. Average dstance matrx, representatve torson angle and representatve secondary structure are the representatve structure of each cluster. The frequency profle for a gven sequence segment s compared wth the centrod of the each cluster n order to calculate dstance score. A smaller dstance score shows that the frequency profle of the gven sequence segment s closer to the centrod for a gven cluster. The relablty score of a gven sequence segment for a cluster s determned by the sum of the frequency of the matched amno acd n the correspondng poston of the average frequency profle of a cluster. The dstance score of each cluster for a gven sequence segment s calculated n order to flter out some less sgnfcant cluster. If the dfference of the cluster s dstance score and the smallest dstance score s wthn 100, ths cluster s selected. Other clusters are dscarded snce they are less sgnfcant. The cluster wth the hghest relablty score among the selected clusters fnally functons to predct the structure of ths sequence segment. Our results ndcate that clusters wth hgh qualty provde the relable predcton results and clusters wth average

26 11 qualty produces hgh qualty results. Specal cauton need be taken aganst predcton results by the bad cluster group Clusterng Support Vector Machne for Local Proten Structure Predcton The central deas of support vector machnes are to map the nput space nto another hgher dmensonal feature space usng the kernels functon and to buld an optmal hyperplane n that feature space (Vapnk, 1998). One of mportant questons s that how we can buld the hyperplane that has strong generalzaton capablty n the hgh dmensonal feature space. The second queston s that how we can avod the curse of dmensonalty n ths hgh dmensonal feature space. The Mercer s Theorem helps us avod mappng the nput space nto another hgher dmensonal space explctly. Mercer s theorem ndcates that any kernel functon satsfyng Mercer s condton can calculate the nner product of two vectors n some hgh dmensonal Hlbert space. Based on Mercer theorem, the hgh-dmensonal feature space need not be consdered drectly durng the process of fndng the optmal hyperplane. Instead, the nner products between support vectors and the vectors n the feature space can be calculated. SVM has two layers. In the frst layer, nput vectors are mplctly transformed and each nner product between the nput vector and support vectors are calculated based on the kernel functon. In the second layer, the lnear decson functon s bult n the hgh dmensonal feature space. The best SV machne wth the smallest expected rsks has smallest VC dmenson. SVMs are based on the dea of mappng data ponts to a hgh dmensonal feature space where a separatng hyperplane can be found. SVMs are searchng the optmal separatng hyperplane by solvng a convex quadratc programmng (QP). The typcal runnng tme for the convex quadratc programmng s Ω (m 2 ) for the tranng set wth m samples. The convex quadratc programmng s NP-complete n the worst case (Vavass, 1991). Therefore, SVMs are not

27 12 favorable for a large dataset (Chang and Ln, 2001). Our dataset contans a half mllons samples. Expermental results show that tranng of SVM for a half mllons samples s not complete after one month on the poweredge6600 server wth four processors from Dell. Many algorthms and mplementaton technques have been developed to enhance SVMs n order to ncrease ther tranng performance wth large data sets. The most well-known technques nclude chunkng (Vapnk, 1998), Osuna s decomposton method (Osuna, Freund, and Gros, 1997), Sequental Mnmal Optmzaton (SMO) (Platt, 1999) and boostng algorthms (Pavlov, Mao and Dom, 2000). The success of these methods depends on dvdng the orgnal quadratc programmng (QP) problem nto a seres of smaller computatonal problems n order to reduce the sze of each QP problem. Although these algorthms accelerate the tranng process, these algorthms do not scale well wth the sze of the tranng data. The second class of algorthms tres to speed up the tranng process by reducng the number of tranng data. Snce some data ponts such as the support vectors are more mportant to determne the optmal soluton, these algorthms provde SVMs wth hgh qualty data ponts durng the tranng process. Random Selecton (Balcazar, Da and Watanabe, 2001) and clusterng analyss (Yu, Yang, and Han, 2000) are representatves of these algorthms. Ther algorthms are hghly scalable for the large data set whle the performance of tranng depends greatly on the selecton of tranng samples. In order to solve the problems related to large sample tranng, Clusterng Support Vector Machnes are proposed n ths work. Understandng proten sequence-to-structure relatonshp s one of the most mportant tasks of current bonformatcs research. The knowledge of correspondence between the proten sequence and ts structure can play very mportant role n proten structure predcton (Rahman and Zomaya, 2005). Han and Baker have used the K-means

28 13 clusterng algorthm to explore proten sequence-to-structure relatonshp. Proten sequences are represented wth frequency profles. Wth the K-means clusterng algorthm, hgh qualty sequence clusters have been produced (Han and Baker, 1996). They have used these hgh qualty sequence clusters to predct the backbone torson angles for local proten structure (Bystroff and Baker, 1998). In ther work and our prevous works, the K-means clusterng algorthm s essental to understand how proten sequences correspond to local 3D proten structures. However, the conventonal clusterng algorthms such as the K-means and K-nearest neghbor algorthm assume that the dstance between data ponts can be calculated wth exact precson. When ths dstance functon s not well characterzed, the clusterng algorthm may not reveal the sequence-to-structure relatonshp effectvely. As a result, some of clusters provde poor correspondence between proten sequences and ther structures. SVM can handle the nonlnear classfcaton by mplctly mappng nput samples from the nput feature space nto another hgh dmensonal feature space wth the nonlnear kernel functon. Therefore, SVM may be more effectve to reveal the nonlnear sequence-to-structure relatonshp than K-means clusterng does. The superor performance for non-lnear classfcaton nspres us to explore the relatonshp between the proten sequence and ts structure wth SVM. Tranng SVM over the whole feature space contanng almost half mllon data samples takes a long tme. Furthermore, each subspace of the whole feature space corresponds to dfferent local 3D structures n our applcaton. As a result, constructon of one SVM for the whole feature space cannot take advantage of the strong generalzaton power of SVM effcently. The dsadvantage of buldng one SVM over the whole feature space motvates us to consder the theory of granular computng.

29 14 Granular computng decomposes nformaton n the form of some aggregates such as subsets, classes, and clusters of a unverse and then solves the targeted problems n each granule (Yao, 2004). Granular constructon and computng are two major tasks of granular computng (Yao, 2005). Granular computng conceptualzes the whole feature space at dfferent granulartes and swtch among these granulartes (Yao, 2004). Wth the prncples of dvde-and-conquer, granular computng breaks up the complex problems nto smaller and computatonally smpler problems and focuses on each small problem by omttng unnecessary and rrelevant nformaton. As a result, granular computng can ncrease ntellgence and flexblty of data mnng algorthms. To combne the theory of granular computng and prncples of the statstcal learnng algorthms, we propose a new computatonal model called Clusterng Support Vector Machnes (CSVMs) n our work. In ths new computatonal model, one SVM s bult for each nformaton granule defned by sequence clusters created by the clusterng algorthm. CSVMs are modeled to learn the nonlnear relatonshp between proten sequences and ther structures n each cluster. SVM s not favorable for large amount of data samples. However, CSVMs can be easly parallelzed to speed up the modelng process. After ganng the knowledge about the sequence to structure relatonshp, CSVMs are used to predct dstance matrces, torson angles and secondary structures for backbone α-carbon atoms of proten sequence segments. Compared wth the clusterng system ntroduced prevously, CSVMs can estmate how close frequency profles of proten sequences correspond wth local 3D structures by usng the nonlnear kernel. Introducton of CSVMs can potentally mprove the accuracy of local proten structure predcton.

30 15 CSVMs are bult from nformaton granules, whch are ntellgently parttoned by clusterng algorthms. Intellgent parttonng by clusterng algorthms provdes true and natural representatons of nherent data dstrbuton of the system. Because of data parttonng, a complex classfcaton problem s converted nto multple smaller problems so that learnng tasks for each CSVM are more specfc and effcent (He et al., 2006). Each CSVM can concentrate on hghly related samples n each feature subspace wthout beng dstracted by nosy data from other clusters. As a result, CSVMs can potentally mprove the generalzaton capablty for classfcaton problems. Snce granulaton by K-means clusterng may ntroduce nose and rreverent nformaton nto each granule, the machne learnng technques are requred to dentfy the strength of correspondence between frequency profles and 3D local structure for each sequence segment belongng to the same nformaton granule. After learnng the relatonshp between frequency profle dstrbuton and 3D local structures, CSVMs can flter out potentally unrelable predcton and can select potentally relable predcton for each granule. Because our unpublshed results reveal that the dstrbuton patterns for frequency profles n each cluster s qute dfferent, the functonalty and tranng of CSVMs s customzed for each cluster belongng to dfferent cluster groups. The CSVMs for clusters belongng to the bad cluster group are desgned to dentfy sequence segments whose structure can be relably predcted. The CSVMs for clusters belongng to the good cluster group are traned to flter out sequence segments whose structure cannot be relably predcted. Local proten structure predcton by CSVMs s based on the predcton method from the clusterng algorthm. At frst, the sequence segments whose structures to be predcted are assgned to a specfc cluster n the cluster group by the clusterng algorthm. Then CSVM

31 16 traned for ths specfc cluster s used to dentfy how close the frequency profle of ths sequence segment s nonlnearly correlated to the 3D local structure of ths cluster. If the sequence segment s predcted as the postve sample by CSVM, the frequency profle of ths segment has the potental to be closely mapped to 3D local structure for ths cluster. Consequently, the 3D local structure of ths cluster can be safely assgned to ths sequence segment. The method to decde the 3D local structure of each cluster can be found n Chapter 12. If the sequence segment s predcted as the negatve sample by CSVM, the frequency profle of ths segment does not closely corresponds to the 3D local structure for ths cluster. The structure of ths segment cannot be relably predcted by ths cluster. Ths cluster s removed from the cluster group. The cluster membershp functon calculatng dstance scores and relablty scores s used to select the next cluster from the remanng clusters of the cluster group. The prevous procedure wll be repeated untl one SVM modeled for the selected cluster predct the gven sequence segment as postve. Important knowledge about the correspondence between frequency profles and the 3D local structure provded by CSVMs can provde the addtonal dependable metrc of cluster membershp assgnment. Average accuracy for CSVMs s over 80%, whch ndcates that the generalzaton power for CSVMs s strong enough to recognze the complcated pattern of sequence-to-structure relatonshps. CSVM modeled for dfferent cluster group obtans good capablty to dscrmnate between postve samples and negatve samples. CSVMs for the bad cluster group are able to select frequency profles of sequence segments whose structure can be relably predcted. The recall value for CSVMs belongng to the good cluster group reaches 96%. Ths hgh value reveals that CSVMs dd not msclassfy frequency profles of sequence segments whose structure can be accurately predcted. The precson value for CSVMs belongng to the good cluster group

32 17 reaches 86%. The hgh precson value demonstrates that CSVMs belongng to the good cluster group obtan the capablty to flter out the frequency profles of sequence segments whose structure cannot be relably predcted. Compared wth the clusterng system ntroduced prevously, our expermental results show that accuracy for local structure predcton has been mproved notceably when CSVMs are appled. 1.2 Dssertaton Organzaton Ths dssertaton has been dvded nto four parts. In the frst part of dssertaton, I dscuss how proten structures are represented and why proten structure predcton s mportant. The frst part covers Chapter 2. In the second part of dssertaton, I dscuss the new mproved K-means clusterng for sequence cluster and motf dscovery. Then I explan how sequence varaton for sequence clusters may nfluence ts structural smlarty. Based on the above nformaton, the clusterng system s developed n order to carry out local proten structure predcton. The second part expands from Chapter 3 to Chaper 8. The thrd part of the dssertaton dscusses the new clusterng support machne to perform local proten structure predcton snce the clusterng system used n the second part may not capture non-lnear sequence to structure relatonshp effectvely. The thrd part of the dssertaton also explans the conclusons and future work. The thrd part covers Chapter 9. The fourth part of the dssertaton wll provde the conclusons and future work. The fourth part covers Chapter 10. In Chapter 2, four levels of proten structure are explaned frst. Then how proten structure can be expermentally determned s ntroduced. In the thrd part of ths chapter, three major computatonal methods to predct proten structure are dscussed n detals.

33 18 In Chapter 3, an mproved K-means clusterng algorthm s ntroduced n order to explore recurrng sequence motfs of protens. Informaton about local proten sequence motfs s very mportant to the analyss of bologcally sgnfcant conserved regons of proten sequences. Ths chapter has been dvded nto fve sectons. Frst, the major motf dscovery methods are dscussed. Then, the major characterstcs of the tradtonal and mproved K-means algorthms are compared. In secton 3.3, the expermental setup s explaned. In secton 3.4, expermental results are presented to show that the mproved K-means algorthm s better than the tradtonal K-means algorthm and to gve evdence that our research fnd some prevously undscovered sequence motfs. In secton 3.5, our research s compared to other state-of-art approaches n order to emphasze the advantages of our research. The long searchng tme for ntal ponts motvates us to mplement the parallel K-means algorthm n order to reduce the searchng tme for sutable ntal ponts to one to two days. In Chapter 4, the parallel K-means algorthm s ntroduced. The parallelzaton of the mproved K- means algorthm wll make exploraton of the optmal cluster number possble. We predct that the performance gans for the mproved K-means algorthm wll be ncreased further after the optmal cluster number s found. In ths chapter, two mportant parallelzaton technques for the K-means clusterng algorthm are dscussed. Then programmng envronment and mplementaton detals are explaned. Fnally, expermental results for speedup values are presented. In Chapter 5, we want to dscuss how sequence varaton for sequence clusters may nfluence ts structural smlarty. How sequence varaton for sequence clusters may nfluence ts structural smlarty s one of the most mportant tasks of current bonformatcs research. In ths chapter, prevous studes for sequence and structural varaton of sequence clusters are revewed

34 19 frst. Then recurrent clusterng, data set and generaton of sequence segments are ntroduced. Evaluaton of sequence varaton and structural smlarty s dscussed n detal. Fnally, results of analyss about the relatonshp between sequence varaton and structural varaton are gven. In Chapter 3 and 5, we have dscussed the mproved K-means algorthm for sequence motf dscovery and how sequence varaton for sequence clusters may nfluence ts structural smlarty. Based on above knowledge, the clusterng system s developed for local proten structure predcton n the Chapter 6. In ths chapter, how to cluster sequence segments nto clusters s explaned frst. Then the method to calculate the representatve structure for each cluster s explaned. Dstance score and relablty score to decde the cluster membershp s dscussed. The performance evaluaton and expermental results are explaned n the last part of ths chapter. In Chapter 7, Support Vector Machnes wll be explaned n detals. Support Vector Machnes are a new generaton of learnng machnes, whch have been successfully appled to a wde varety of applcaton domans (Crstann and Shawe Taylor, 2000) ncludng bonformatcs (Schoelkopf, Tsuda and Vert, 2000). Constructon of optmal hyperplane that can separate samples belongng to the frst class from samples belongng to the second class wth the maxmal margn s the essental task of SVM. In ths chapter, the concept of optmal hyperplane and optmzaton problems to construct optmal hyperplane n the lnearly separable case and n the lnearly nonseparable case wll be dscussed frst. Then the expected rsk bounds are evaluated to assess the effectveness of support vector machnes. In addton, the quadratc optmzaton and lnear optmzaton method to buld SVMs are dscussed. SVM Kernels play key roles n calculatng the nner products between support vectors and the vectors mplctly n the hgh dmensonal feature space, several mportant SVM kernels are ntroduced n ths secton. In real

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features