A Parallel Implementation of a Higher-order Self Consistent Mean Field. Effectively solving the protein repacking problem is a key step to successful

Similar documents
Small Libraries of Protein Fragments through Clustering

A FILTERING TECHNIQUE FOR FRAGMENT ASSEMBLY- BASED PROTEINS LOOP MODELING WITH CONSTRAINTS

Molecular Distance Geometry and Atomic Orderings

Workshop #7: Docking. Suggested Readings

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs

How Do We Measure Protein Shape? A Pattern Matching Example. A Simple Pattern Matching Algorithm. Comparing Protein Structures II

Chapter 16 Heuristic Search

CS/ECE 566 Lab 1 Vitaly Parnas Fall 2015 October 9, 2015

CSC630/CSC730 Parallel & Distributed Computing

A Hillclimbing Approach to Image Mosaics

Homology Modeling Professional for HyperChem Release Notes

Chapter 2 Basic Structure of High-Dimensional Spaces

Science Olympiad Protein Modeling Event Guide to Scoring Zinc Finger Motif

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Chapter 5: Analytical Modelling of Parallel Programs

A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture

Chapter 2. Related Work

Tree-based methods for classification and regression

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces

Approximation of Protein Structure for Fast Similarity Measures

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Hopalong Fractals. Linear complex. Quadratic complex

K-Means Clustering Using Localized Histogram Analysis

Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD. Abstract

Molecular docking tutorial

An Algorithmic Approach to Graph Theory Neetu Rawat

Bioinformatics explained: Smith-Waterman

Simple REFMAC tutorials

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

Throughout this course, we use the terms vertex and node interchangeably.

Using Number Skills. Addition and Subtraction

Adaptive Inference and Its Applications to Protein Modeling

Newton and Quasi-Newton Methods

Extra-Homework Problem Set

TELCOM2125: Network Science and Analysis

7. Decision or classification trees

Introduction Entity Match Service. Step-by-Step Description

Easy manual for SIRD interface

10th August Part One: Introduction to Parallel Computing

Nature Structural & Molecular Biology: doi: /nsmb.2467

ARCHER ecse Technical Report. Algorithmic Enhancements to the Solvaware Package for the Analysis of Hydration. 4 Technical Report (publishable)

Manual for RNA-As-Graphs Topology (RAGTOP) software suite

Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts

High Performance Computing Systems

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Jim Metzler. Introduction. The Role of an ADC

A Course in Machine Learning

Programming assignment for the course Sequence Analysis (2006)

DynOmics Portal and ENM server

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Scientific Computing: An Introductory Survey

Conformations of Proteins on Lattice Models. Jiangbo Miao Natalie Kantz

Lecture Summary CSC 263H. August 5, 2016

Real-time grid computing for financial applications

Lesson 9: Decimal Expansions of Fractions, Part 1

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku

Bioinformatics III Structural Bioinformatics and Genome Analysis

Random Search Report An objective look at random search performance for 4 problem sets

Multicore Computing and Scientific Discovery

Chapter 12: Indexing and Hashing. Basic Concepts

CONNECTING TESTS. If two tests, A and B, are joined by a common link ofkitems and each test is given to its own sample ofnpersons, then da and drb

CHAPTER 1 INTRODUCTION

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Natural Language Processing

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Computational Methods. Randomness and Monte Carlo Methods

UCLA UCLA Previously Published Works

Chapter 12: Indexing and Hashing

1 More configuration model

Meshes: Catmull-Clark Subdivision and Simplification

Multi-Way Number Partitioning

Predicting Protein Folding Paths. S.Will, , Fall 2011

Distributed Protein Sequence Alignment

Classification: Feature Vectors

Chapter 5. Track Geometry Data Analysis

University of Florida CISE department Gator Engineering. Clustering Part 2

Optimised corrections for finite-difference modelling in two dimensions

Paul's Online Math Notes. Online Notes / Algebra (Notes) / Systems of Equations / Augmented Matricies

Vrms v1.0 VeraChem LLC, 2006

Adaptive-Mesh-Refinement Pattern

An advanced greedy square jigsaw puzzle solver

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

Unsupervised Learning and Clustering

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Geometric Mean Algorithms Based on Harmonic and Arithmetic Iterations

Machine Learning : supervised versus unsupervised

LP360 (Advanced) Planar Point Filter

Ranking Clustered Data with Pairwise Comparisons

Introduction to File Structures

Assessing the homo- or heterogeneity of noisy experimental data. Kay Diederichs Konstanz, 01/06/2017

MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns

Lesson 1 1 Introduction

Prentice Hall Mathematics: Course Correlated to: Colorado Model Content Standards and Grade Level Expectations (Grade 6)

Unsupervised Learning and Clustering

A Scalable Parallel HITS Algorithm for Page Ranking

Mac OS X or Linux Python 2.7 Rosetta ( PyRosetta4 (

Proceedings of the SEM IMAC XXX Conference Jan. 30 Feb. 2, 2012, Jacksonville, FL USA 2011 Society for Experimental Mechanics Inc.

An Active Learning Approach to Efficiently Ranking Retrieval Engines

Transcription:

Karl Gutwin May 15, 2005 18.336 A Parallel Implementation of a Higher-order Self Consistent Mean Field Effectively solving the protein repacking problem is a key step to successful protein design. Put simply, when designing a protein, one starts with the intended backbone structure, on which a sequence of residues is to be placed with the ultimate goal that that sequence natively folds to the desired structure. Important features of these structures include that the chosen sequence packs in such a way that a minimum of unfilled space is left within the core of the protein, as well as there are no significant spatial clashes between atoms. An early method for solving this problem, and still widely employed, is the rotamer packing method. A rotamer is the name given to the collection of bond angles which uniquely define a particular conformation of a residue. Rotamers are usually found in sets called libraries, where members of these sets are used, computationally, to discretize the problem of searching over infinite conformational space over many dimensions. Rotamer libraries are often generated through statistical analysis on known molecular structures, and can describe a significant fraction of the accessible conformational space of a given residue. The rotamer packing method, at its core, needs to choose the best rotamer out of a library for each position in a protein chain. Here, the best rotamer is defined as the one which, out of all other possible rotamers, produces the least amount of clashes with other residues and stabilizes best the overall structure. Unfortunately, as may be apparent, this problem is hard to solve simultaneously, especially in a moderately sized protein of 250 residues (and approximately ten rotamers per residue). Therefore, several algorithms are

available which reduce the runtime significantly. One of these algorithms, on which I will focus the remainder of this paper, is called the Self-Consistent Mean Field. The first paper describing this method was published by Koehl and Delarue in 1994. 1 In it, they describe their application of mean field theory. To be succinct (a detailed explanation of this theory is given elsewhere 2 ) the theory states that the effective potential of a given rotamer is given by the sum over all other rotamers of the probability that that rotamer exists in the final structure multiplied by its true pairwise potential. This probability can be subsequently calculated using the Boltzmann distribution. These relationships form the basis for an iterative process whereby effective potentials are calculated for each residue and then probabilities (stored in what the authors term the conformational matrix) are calculated on the basis of the effective potentials. This process converges over 20 or fewer rounds. At the end of the process the best rotamer can be selected simply by choosing the one with the highest probability in each residue position. The advantage of this procedure is that it operates in linear time with respect to the number of residues. Unfortunately, it works under an assumption that residues which exhibit statistical dependence on one another will eventually choose the right rotamer through this iterative process. However, this assumption has not been shown to be true. To put forth some evidence in resolving this, and perhaps to create a more accurate version of this method, I added explicit dependence to the self-consistent mean field (SCMF). In the original SCMF definition, each group of rotamers was a single residue. In my new definition, each group is actually comprised of the rotamers from several 1 Koehl, P and M. Delarue. J Mol Bio 239, 249-275 (1994). 2 Koehl, P and M. Delarue. Curr Opin Struct Biol 6, 222-226 (1996).

residues. Therefore, the number of rotamers is now the product of the number of rotamers from each comprising residue, and each new rotamer set contains a rotamer from each comprising residue. Otherwise, everything is computed identically. This has the benefit that when the number of rotamers in each residue group is set equal to one, the algorithm computes the original SCMF. This will be useful in final comparisons. D Groups Sets/group Total sets 1 100 10 1000 2 50 100 5000 3 34 1000 34000 4 25 10000 250000 Table 1. Exponential growth of higher-order SCMF. The example given is with a 100 residue protein with an average of 10 rotamers per residue. One major issue with this method is that as the size of each group grows (what I term the dimensionality of the problem) the number of rotamer sets grows exponentially. Table 1 shows this growth for a small protein. As one of the steps to mitigate the computational impact of this growth, I have implemented this solution using parallel computing. My implementation uses straight C/MPI with a minimum of outside libraries (only libm). In preparing the parallel implementation of this, some hurdles had to be overcome. The first involved memory issues. The original SCMF precalculated all the per-residue true potentials, since these did not change over the course of the calculation. However, it became quickly apparent to me that the size of this calculation would exceed the memory capacity of our cluster if the dimensionality went much higher than two or three. To mitigate this, I elected to recalculate all true potentials at each iteration. This had the advantage of allowing larger calculations to continue, but at the cost of a significantly longer runtime. The second hurdle involved load balancing. The basic unit of the computation is the rotamer set. However, different residue groups have different

numbers of rotamer sets sometimes very significant differences as the dimensionality grows larger. To mitigate this, I had to divide the residue groups among the processors such that the number of sets per processor was roughly equal. This did not help when the dimensionality grew large, however, since some groups would have more sets than could be distributed among the other processors. This created a bottleneck where adding more processors did not decrease runtime. D RMSD 1 18398.3 2 18392.3 3 18323.7 4 18318.4 Table 2. RMSD comparisons between template structure and best SCMF calculated structure. Tests done using PDB ID 1NLM. In order to evaluate the results of my code, a number of different metrics could be used. Unfortunately I only had sufficient time to test, in a somewhat limited fashion, a simple but effective metric. RMSD (root mean square deviation) is a common metric in the protein structure field since it quickly provides a number which can be easily compared between other structures of the same molecule Table 2 reports the results from calculating the RMSD between the template structure and the best SCMF calculated structure with various dimensionalities. It is clear that the RMSD does decrease as the dimensionality increases; however, it is interesting to note that the RMSD between various recalculated structures is much higher than the difference between their RMSDs against the template structure. This suggests that increasing the dimensionality does not uniformly increase the number of best choices in other words, the changes from D=2 to D=3 are not simply a few better choices but a collection of different choices which overall total a better score. Any report on such a parallel process would be incomplete without some mention of runtime. It is interesting to note that I anticipated nearly a p-fold speedup given the

nature of the calculation which is quite simply parallelized. However, the bottlenecking issue did come into play 1000 Run time vs. NP quite strongly, especially at the higher dimensionalities. Figure 1 shows these results. At D=1, the curve is nearly linear, Run time (secs) D = 1 D = 2 D = 3 reflecting p-fold speedup. However, at higher dimensionalities, the curves become bent, suggesting some alternate effects. The 100 NP 4 NP 8 NP 16 Figure 1. Runtime analysis performed on PDB ID 1NLM. bottlenecking effect is unpredictable and appears to have been shown, in this instance, at NP=8 for D=3 as well as NP=16 for D=2. One test run was done at D=4, for NP=16 only, which took 7820 seconds. This test was not repeated at lower processor numbers, and no tests were performed at higher dimensionalities, for reasons of time constraints. There are still many aspects of this project which are left open for improvement. One clear item is that during the course of this algorithm, many rotamer sets are generated which ultimately have such a low conformational matrix value that they never could be viable final candidates. To cull these sets would dramatically decrease processing and would allow for easily doubling the maximum possible dimensionality. The program is structured such that removing these sets would not be too difficult; however, no time was available to implement such a step. Also, the bottlenecking issue could be resolved by breaking up residue groups; however, culling those sets which are very large would probably have a much greater impact. Finally, it should be mentioned that the current version of the program creates groups by sequential addition; this is not the best strategy structurally speaking since residues often interact over large sequential

distances. A clustering algorithm could identify those clusters of residues which would benefit most from being grouped together. Overall, this project has shown success in implementing a parallel, higher-order self-consistent mean field. Preliminary results do suggest that such an implementation produces better results overall than a similar, first-order method. Further improvements are possible which could allow for a further increase in dimensionality as well as more accurate clustering. While it is unlikely that this method will ultimately gain widespread use (due to other, advanced algorithms which perform well on single-processor workstations) this may prove to be the most accurate method available.