CHARACTERIZING THE SPACE OF INTERATOMIC DISTANCE DISTRIBUTION FUNCTIONS CONSISTENT WITH SOLUTION SCATTERING DATA

Similar documents
Lecture # 04. Image Enhancement in Spatial Domain

Topic -3 Image Enhancement

(a, b) x y r. For this problem, is a point in the - coordinate plane and is a positive number.

Color Correction Using 3D Multiview Geometry

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension

Controlled Information Maximization for SOM Knowledge Induced Learning

A modal estimation based multitype sensor placement method

Image Enhancement in the Spatial Domain. Spatial Domain

Illumination methods for optical wear detection

FACE VECTORS OF FLAG COMPLEXES

An Unsupervised Segmentation Framework For Texture Image Queries

Conservation Law of Centrifugal Force and Mechanism of Energy Transfer Caused in Turbomachinery

RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES

Gravitational Shift for Beginners

Extract Object Boundaries in Noisy Images using Level Set. Final Report

5 4 THE BERNOULLI EQUATION

Assessment of Track Sequence Optimization based on Recorded Field Operations

Detection and Recognition of Alert Traffic Signs

Title. Author(s)NOMURA, K.; MOROOKA, S. Issue Date Doc URL. Type. Note. File Information

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives

ADDING REALISM TO SOURCE CHARACTERIZATION USING A GENETIC ALGORITHM

4.2. Co-terminal and Related Angles. Investigate

Point-Biserial Correlation Analysis of Fuzzy Attributes

A Memory Efficient Array Architecture for Real-Time Motion Estimation

Comparisons of Transient Analytical Methods for Determining Hydraulic Conductivity Using Disc Permeameters

A Shape-preserving Affine Takagi-Sugeno Model Based on a Piecewise Constant Nonuniform Fuzzification Transform

User Specified non-bonded potentials in gromacs

Prof. Feng Liu. Fall /17/2016

A Mathematical Implementation of a Global Human Walking Model with Real-Time Kinematic Personification by Boulic, Thalmann and Thalmann.

Multi-azimuth Prestack Time Migration for General Anisotropic, Weakly Heterogeneous Media - Field Data Examples

Monte Carlo Simulation for the ECAT HRRT using GATE

Monte Carlo Techniques for Rendering

A Novel Automatic White Balance Method For Digital Still Cameras

Small Angle Neutron Scattering (SANS)

Accurate Diffraction Efficiency Control for Multiplexed Volume Holographic Gratings. Xuliang Han, Gicherl Kim, and Ray T. Chen

HISTOGRAMS are an important statistic reflecting the

Image Registration among UAV Image Sequence and Google Satellite Image Under Quality Mismatch

IP Network Design by Modified Branch Exchange Method

A Two-stage and Parameter-free Binarization Method for Degraded Document Images

Optical Flow for Large Motion Using Gradient Technique

Communication vs Distributed Computation: an alternative trade-off curve

Improved Fourier-transform profilometry

ISyE 4256 Industrial Robotic Applications

A Recommender System for Online Personalization in the WUM Applications

Massachusetts Institute of Technology Department of Mechanical Engineering

Shortest Paths for a Two-Robot Rendez-Vous

Also available at ISSN (printed edn.), ISSN (electronic edn.) ARS MATHEMATICA CONTEMPORANEA 3 (2010)

On Error Estimation in Runge-Kutta Methods

Haptic Glove. Chan-Su Lee. Abstract. This is a final report for the DIMACS grant of student-initiated project. I implemented Boundary

MapReduce Optimizations and Algorithms 2015 Professor Sasu Tarkoma

= dv 3V (r + a 1) 3 r 3 f(r) = 1. = ( (r + r 2

Lecture 27: Voronoi Diagrams

Topological Characteristic of Wireless Network

vaiation than the fome. Howeve, these methods also beak down as shadowing becomes vey signicant. As we will see, the pesented algoithm based on the il

A New and Efficient 2D Collision Detection Method Based on Contact Theory Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai MIAO, Jian XUE

Frequency Domain Approach for Face Recognition Using Optical Vanderlugt Filters

DEADLOCK AVOIDANCE IN BATCH PROCESSES. M. Tittus K. Åkesson

Experimental and numerical simulation of the flow over a spillway

Analysis of uniform illumination system with imperfect Lambertian LEDs

Resolution and stability analysis of offset VSP acquisition scenarios with applications to fullwaveform

Effects of Model Complexity on Generalization Performance of Convolutional Neural Networks

Generalized Grey Target Decision Method Based on Decision Makers Indifference Attribute Value Preferences

View Synthesis using Depth Map for 3D Video

Automatically Testing Interacting Software Components

A NOVEL VOLUME CT WITH X-RAY ON A TROUGH-LIKE SURFACE AND POINT DETECTORS ON CIRCLE-PLUS-ARC CURVE

Towards Adaptive Information Merging Using Selected XML Fragments

Survey of Various Image Enhancement Techniques in Spatial Domain Using MATLAB

An Extension to the Local Binary Patterns for Image Retrieval

Annales UMCS Informatica AI 2 (2004) UMCS

Clustering Interval-valued Data Using an Overlapped Interval Divergence

Geophysical inversion with a neighbourhood algorithm I. Searching a parameter space

Positioning of a robot based on binocular vision for hand / foot fusion Long Han

Scaling Location-based Services with Dynamically Composed Location Index

UCLA Papers. Title. Permalink. Authors. Publication Date. Localized Edge Detection in Sensor Fields.

DISTRIBUTION MIXTURES

Improvement of First-order Takagi-Sugeno Models Using Local Uniform B-splines 1

Obstacle Avoidance of Autonomous Mobile Robot using Stereo Vision Sensor

POMDP: Introduction to Partially Observable Markov Decision Processes Hossein Kamalzadeh, Michael Hahsler

3D Hand Trajectory Segmentation by Curvatures and Hand Orientation for Classification through a Probabilistic Approach

Layered Animation using Displacement Maps

Transmission Lines Modeling Based on Vector Fitting Algorithm and RLC Active/Passive Filter Design

Conversion Functions for Symmetric Key Ciphers

Tissue Classification Based on 3D Local Intensity Structures for Volume Rendering

A VECTOR PERTURBATION APPROACH TO THE GENERALIZED AIRCRAFT SPARE PARTS GROUPING PROBLEM

EYE DIRECTION BY STEREO IMAGE PROCESSING USING CORNEAL REFLECTION ON AN IRIS

Methods for history matching under geological constraints Jef Caers Stanford University, Petroleum Engineering, Stanford CA , USA

Voting-Based Grouping and Interpretation of Visual Motion

OPTIMUM DESIGN OF 3R ORTHOGONAL MANIPULATORS CONSIDERING ITS TOPOLOGY

The Internet Ecosystem and Evolution

A Novel Image-Based Rendering System With A Longitudinally Aligned Camera Array

Module 6 STILL IMAGE COMPRESSION STANDARDS

Cardiac C-Arm CT. SNR Enhancement by Combining Multiple Retrospectively Motion Corrected FDK-Like Reconstructions

Complete Solution to Potential and E-Field of a sphere of radius R and a charge density ρ[r] = CC r 2 and r n

n If S is in convex position, then thee ae exactly k convex k-gons detemined by subsets of S. In geneal, howeve, S may detemine fa fewe convex k-gons.

Elliptic Generation Systems

AUTOMATED LOCATION OF ICE REGIONS IN RADARSAT SAR IMAGERY

Drag Optimization on Rear Box of a Simplified Car Model by Robust Parameter Design

FINITE ELEMENT MODEL UPDATING OF AN EXPERIMENTAL VEHICLE MODEL USING MEASURED MODAL CHARACTERISTICS

Cross-Validation of Data in SAXS and Cryo-EM

Transcription:

CHARACTERIZING THE SPACE OF INTERATOMIC DISTANCE DISTRIBUTION FUNCTIONS CONSISTENT WITH SOLUTION SCATTERING DATA Paitosh A Kavatheka Depatment of Compute Science, Datmouth College, Hanove, NH 3755 Buce A Caig Depatment of Statistics, Pudue Univesity, West Lafayette, Indiana 4797, USA Alan M Fiedman Depatment of Biological Sciences, Pudue Cance Cente and Bindley Bioscience Cente, Pudue Univesity, West Lafayette, Indiana 4797, USA Chis Bailey-Kellogg and Devin J Balkcom Depatment of Compute Science, Datmouth College, Hanove, NH 375 Email: devin@csdatmouthedu Abstact: Scatteing of neutons and x-ays fom molecules in solution offes altenative appoaches to the studying of a wide ange of macomolecula stuctues in thei solution state without the need of cystallization In this pape, we study one pat of the poblem of elucidating thee-dimensional stuctue fom solution scatteing data, detemining the distibution of inteatomic distances, This poblem is known to be ill-conditioned; fo a single obseved diffaction patten, thee may be many consistent distance distibution functions Due to the ill conditioning, thee is a isk of ovefitting the obseved scatteing data We popose a new appoach to avoiding this poblem, accepting the validity of multiple altenative cuves athe than seeking a single best We show that thee ae linea constaints that ensue that a computed is consistent with the expeimental data The constaints enfoce smoothness in the cuve, ensue that the cuve is a pobability distibution, and allow fo expeimental eo We use these constaints to pecisely descibe the space of all consistent cuves as a polytope of histogam values o Fouie coefficients This desciption can then be used to sample the space of potential altenative cuves We use this desciption to develop a linea pogamming appoach to sampling the space of consistent, ealistic cuves In tests on both expeimental and simulated scatteing data, ou appoach efficiently geneates ensembles of such cuves that display substantial divesity In paticula, we show that the ensemble of cuves geneated fo a given potein includes membes that ae moe diffeent fom a efeence cuve fo that potein than ae efeence cuves fo poteins of othe stuctual topologies Thus subsequent econstuction steps must popely account fo this divesity in optimizing stuctual models 1 INTRODUCTION Thee is cuently no single best expeimental technique fo studying the stuctue of macomolecules Cystallizing poteins fo x-ay cystallogaphy is often difficult, while NMR is limited by the size of the potein o complex In this pape, we examine solution scatteing, often called small angle x-ay scatteing (SAXS) when x-ays ae employed and only data to small diffaction angle is collected Solution scatteing is a elatively simple and inexpensive expeimental technique that can be applied to a lage ange of molecula sizes, fom 1-1 Å, without the need fo cystallization 1 SAXS has found widespead applications fo divese poblems such as low esolution Coesponding autho

stuctue pediction of poteins and complexes 2 5, potein folding studies with time-esolved data 6, 7, potein dynamics 8, and detemining the association model fo potein complexes 9 In a SAXS expeiment, a naow beam of x- ays is diected towads a dilute solution of macomolecules The electons in the macomolecule scatte the beam, and a detecto measues the intensity of the scatteed beams in diffeent diections The esultant intensity at any point depends on the elative position of the electons with espect to each othe, yielding a cuve I(q) that expesses the scatteed intensity (I) at diffeent values of the momentum tansfe vecto (q) The I(q) cuve is a function of the potein shape, and although spheical aveaging and expeimental noise cause a loss of infomation in going fom a thee-dimensional stuctue to a one-dimensional scatteing patten, it has poven pactically possible to use SAXS to econstuct the low-esolution stuctue of a macomolecule 2, 1, 3 We study an intemediate poblem in the econstuction pocess: computing a function that descibes the distibution (P ) of inteatomic distances () in a molecule The is a moe intuitive function of the molecula shape than the I(q) cuve, and is a useful intemediate in the econstuction Fo example, the ealspace vesion of the pogam GASBOR 11, 12 poduces models by matching a given cuve The cuve is elated to the I(q) cuve by the following Fouie tanfom 13 : P () = 2 π q I(q) sin(q)dq, (1) whee q = 4π sin θ/λ, with 2θ the scatteing angle and λ the x-ay wavelength Figue 1 summaizes the elationship among stuctue,, and I(q) Although the cuve is elated to the scatteing cuve with the wellestablished integal elation of equation 1, obtaining the cuve fo a potein given its scatteing pofile is not tivial This poblem is ill-conditioned 14 because expeimentally the data is available only fo a finite q ange, wheeas the integal in equation 1 extends between and infinity Applying the diect tansfom to the limited data poduces non-physical cuves The ill-conditioned natue of this poblem ceates significant potential fo ovefitting the data All pevious appoaches ty to poduce a single best cuve and avoid ovefitting by vaious mechanisms Most pevious appoaches ty to poduce cuves that educe the discepancy, χ 2, between the expeimental scatteing cuve and the one pedicted using the Fouie invese of equation 1 Additional constaints ae employed to educe the potential fo ovefitting Ealy appoaches modeled cuves as the summation of continuous basis functions Fo example, Mooe 15 used sine functions ove a esticted inteval and employed a Shannon infomation content citeion to avoid ovefitting by limiting the numbe of sine functions Glatte 16 used B-splines to model cuves and avoided ovefitting by added a damping tem to the χ 2 taget function in ode to obtain a smooth solution Steenstup and Hansen 17, 18 modeled as a discete distibution function defined at a fixed numbe of points They avoided ovefitting by maximizing the entopy of the distibution function subject to the constaint χ 2 1, expessed as a Lagange multiplie One of the most widely used pogams fo constucting cuves is GNOM 14, 19, which uses Tikhonov egulaization 2, 21 in ode to avoid ovefitting GNOM defines a set of peceptual citeia that the use desies in the and sets the egulaization and smoothing paametes so as to best achieve those popeties Ou altenative appoach to avoiding ovefitting is not to seek a single best solution, but athe to accept the validity of multiple feasible solutions consistent with a given I(q) cuve Key to detemining such an ensemble is a epesentation fo cuves that ae not just consistent with a scatteing pofile, but also have popeties such as continuity and smoothness that make them potein-like The main contibutions of this pape ae as follows: (1) We povide a complete chaacteization, as a convex polytope in an appopiate epesentation space, of those cuves that ae consistent with a given scatteing cuve and display ealistic popeties (2) We povide a linea-pogamming based method to quickly geneate consistent, ealistic cuves fo a given scatteing cuve

Diect Calculation Indiect Tansfom Diect Tansfom Shape Potein Solution Expeiment Fig 1 A schematic diagam showing the coespondence between potein shapes, thei cuves, and scatteing cuves The labels on the aows show the techniques used to get one epesentation fom the othe In pactice, diffeent shapes might have close cuves, and diffeent I(q) cuves may map to simila cuves I(q) (3) In tests with both expeimental and simulated data, we demonstate that consistent and ealistic cuves ae significantly divese, such that ensembles fo poteins with diffeent folds can ovelap, limiting the diect identification of potein fold fom scatteing data 2 METHODS Equation 1 descibes the elationship between the scatteing cuve and the inteatomic distance distibution function Ou aims ae to chaacteize the space of all physical cuves that ae consistent with a given I(q) cuve, and to quickly geneate an ensemble of such cuves We fist oveview the appoach, befoe poviding details in the subsections One common epesentation of a cuve is as a histogam of bins centeed at discete values A cuve is thus epesented by a vecto of length n, whee n is the numbe of bins Convesely, we can say that evey point in an n-dimensional space coesponds to a cuve We call this space the space We can chaacteize points in space as consistent (they epesent cuves that satisfy the given I(q) cuve within a pedefined eo toleance), ealistic (they epesent cuves that ae smooth and potein-like), both, o neithe In Section 21 we mathematically define consistent and ealistic cuves Unde that definition, we show that all such cuves lie inside a convex polytope in space Mathematically, we can descibe the set of all solutions with an equation C P () b, (2) whee C is a matix that defines this polytope, a vecto epesenting the histogam, and b a vecto of constants epesenting the constaints Anothe common epesentation of a cuve is as a continuous cuve, a linea combination of basis functions Pactically, the numbe of basis functions is finite, say k Thus a cuve is epesented as a set of k coefficients, defining the linea combination As befoe points in the coefficient space coespond to cuves In Section 22, we extend the esults fo the histogam appoach to coefficient space; if α epesents a point in coefficient space, then all consistent and ealistic cuves lie inside a convex polytope in coefficient space given by C α b (3) The geometic chaacteization of these spaces, in which all consistent cuves lie in a contiguous,

well-defined egion, has advantages both fo undestanding the popeties of the cuves, as well as fo poducing ensembles using linea combinations This leads to ou algoithms in Section 23 fo geneating consistent and ealistic cuves 21 Consistent and ealistic cuves in the histogam epesentation Both and I(q) ae continuous functions Expeimentally, the scatteing intensities ae measued at some discete set of q values Similaly, cuves ae also epesented as histogams with a known bin width Ou method teats both and I(q) as continuous cuves sampled at discete points Although we use discete appoximations of continuous cuves, we do not place any condition on the numbe, width o unifomity of bins Unde discete appoximation we can wite the Fouie invese of equation 1 as follows: I(q) = n i=1 h i sin(q i ) q i P ( i ), (4) whee h i is the width of the bin In the inteest of claity we will ignoe the bin-width paamete h i in some of the equations; this omission does not affect ou famewok We scale I(q) cuves by dividing by I(), the angle scatteing intensity Expeimentally, I() values ae not available, and ae estimated using the Guinie plot 22, 13 Fo a set of discete q values, we can wite equation 4 in a matix fom: I(q 1 ) I(q m ) = sin(q 1 1) sin(q 1 n) q 1 n q 1 1 sin(q m 1) q m 1 sin(q m n) q m n P ( 1 ) P ( n ) (5), I(q) = A P () (6) whee A is the tansfom matix We use to epesent both the inteatomic-distance distibution function and the vecto defined in equation 5; the coect intepetation should be clea fom the context P ( i ) efes to the P () value at the ith sampled point, i We assume that thee ae n such points and m sampled points in the q space We now mathematically define consistent and ealistic cuves in tems of linea constaints on the histogams distibution is a pobability distibution, so it must sum to 1 (This is a nomalizing constaint; P () scales ae elative This nomalization implies that we can teat cuves as pobability distibutions) Mathematically, P ( i ) = 1 i Scatteing intensity The cuve must give ise to the obseved scatteing cuve Ideally the cuve should satisfy equation 5, but pactically because of expeimental limitations and noise we can only constain the pedicted I(q) to a cetain inteval: I(q) σ(q) < A P () < I(q) + σ(q) Hee σ(q) is a column vecto that specifies the allowed eo at each q value In ou implementation we use σ values that depend on the standad deviation of the measued intensities Non-negativity Since P ( i ) ae pobability values, they must all be non-negative: P ( i ) 1, i While these constaints ae loose, fo some (o all) values we can enfoce sticte constaints that foce the pobability values to lie in a paticula inteval Fo example, fo close to o D max, the maximum distance which can be obtained fom the I(q) cuve, we can limit the pobability values to the ange [, ɛ] whee ɛ is suitably close to Continuity One way to ensue that cuves ae smooth is to estict the amount of vaiation in the pobability values between adjacent bins Though such a constaint does not eliminate local maxima o minima, it ensues that these extemal points ae not shap Let β be the maximum pemissible diffeence between adjacent pobability values We can wite the continuity constaint as follows: P ( i+1 ) P ( i ) β, i < n Citeia that influence the selection of β include the numbe of bins, width of the bins, and level of smoothness desied In pactice, a unifom β woks well, but β need not be constant ove the entie

cuve In some egions, such as nea =, shap spikes eflecting atomic packing (with high esolution data) may be acceptable, while in othes they may not be desiable Smoothness Continuity constaints estict abupt changes in values, but potions of the cuve can still have a saw-tooth patten without violating continuity This patten is chaacteized by altenating small local maxima and minima We addess this poblem by enfocing second-ode constaints on consecutive tiples of the cuve: P ( i 1 ) 2P ( i ) + P ( i+1 ) γ, i [2, n 1] These conditions bound the deivative at each point in the discete appoximation of the cuve Hee, γ is a use-defined paamete that need not be constant ove the entie cuve; simila to the continuity constaints, we can have diffeent cuvatue bounds fo diffeent potions of the cuve All constaints used to descibe consistent cuves ae linea, so we can combine them to poduce equation 2, with C as a matix containing the constaints and b a vecto containing the coesponding constants The nomalization constaint is an equality constaint, but may be witten as an equivalent pai of inequality constaints Equation 2 epesents a convex polytope in space that chaacteizes the space of desied solutions; all consistent cuves lie inside this high-dimensional convex polytope, and convesely any point inside this polytope coesponds to a consistent and ealistic cuve 22 Consistent and ealistic cuves in the basis function epesentation Let us expess cuves in a functional basis such as a Fouie basis 15 Then fo any value j we can wite k P ( j ) = α i f i ( j ), i=1 whee f i ae the basis functions, α i the coesponding coefficients, and k the numbe of basis functions suitable to epesent all cuves to the equied esolution Let 1 k be a ow vecto of all ones, with length k Similaly, define k as a ow vecto of length k with all zeos Using 1 k and k we define two matices, M and D, as follows: 1 k k k k 1 k k M = k k 1 k and f 1 ( 1 ) f 2 ( 1 ) f k ( 1 ) D = f 1 ( n ) f 2 ( n ) f k ( n ) whee M is an n kn block-diagonal matix, D is a kn k matix with the basis functions If α is a vecto of coefficients fo the basis functions, then we can expess the vecto in tems of M and D as, P ( 1 ) P ( 2 ) P ( n ) = MD α 1 α 2 α k, (7) Substituting equation 7 in equation 2 (the constaints fom the pevious section) gives CMDα b By defining C = CMD, we get equation 3, the equation of a polytope in the coefficient space We must note that although the constaints ae applied at discete points, the undelying cuves ae continuous 23 Geneating ensembles Intuitively, the most divese cuves lie on vetices of ou polytope While thee exist vetex enumeation algoithms (eg, 23 ), they become impactical in vey high dimensions (of the ode of the numbe of points in a cuve) Futhemoe, the numbe of vetices inceases exponentially with dimensionality Thus we instead seek a divese subset of the vetices

3 25 Altenate GNOM 3 25 Altenate GNOM 2 2 15 15 1 1 5 5 1 2 3 4 5 (a) 1 altenate cuves 1 2 3 4 5 (b) Some example cuves Fig 2 Altenate cuves (ed) geneated by ou method fo expeimental data to q max = 5 fo hen egg white lysozyme, along with the cuve calculated fom the x-ay stuctue (blue) and the econstuction by GNOM (black) We fomulate the poblem of geneating the vetices of the polytope as a linea pogam in coefficient space Let c be a point in the coefficient space; then we solve the following linea pogam: Maximize c α, subject to C α b By maximizing the dot poduct of c with the coefficient vecto α, subject to constaints defined in section 21, we obtain a vetex of the polytope (o a point on the facet, in case of inteio point methods) in the coefficient space To geneate a numbe of vetices, we simply solve the optimization poblem fo many andom vectos c Fo a easonable paamete choice and using Fouie basis functions ou pogam takes about one second to geneate a candidate cuve on a 24GHz Pentium machine using the Matlab solve Theefoe, it is easy to apidly geneate a lage ensemble (and possibly select the most inteesting cuves fom it) To some extent, these vetices can be descibed as maximally divese, since the linea pogam picks a diection in coefficient space and ties to find the maximum possible vaiation in that diection without violating any of the constaints While the vetices captue the envelope of cuves, one might also want to obtain a moe complete ensemble of the satisfying cuves Since simple geneate-and-test algoithms ae extemely inefficient in high dimensions, we instead geneate additional satisfying cuves by epeatedly taking convex combinations of peviously identified satisfying cuves Such cuves ae guaanteed to lie inside the polytope due to its convex natue 3 RESULTS We demonstate the effectiveness and significance of ou appoach in chaacteizing the divesity in ealistic cuves consistent with scatteing data 31 Ensemble Divesity We fist applied ou appoach to expeimental scatteing data up to q max = 5 fo the potein hen egg-white lysozyme, as distibuted with the pogam GNOM 14 Figue 2(a) shows an ensemble of 1 cuves geneated using ou method, and Figue 2(b) shows a few examples chosen fom the ensemble Fo compaison the black cuve shows the output fom the pogam GNOM 14, using default paametes, while the blue cuve is the distance-distibution calculated fom an x-ay cystal stuctue Thee is significant divesity in the consistent, ealistic cuves (see Table 1 fo quantification) In addition to evaluating the global divesity, we may also evaluate the uncetainty at a given point by the height of the ed band We note that both ou ensemble and the cuve geneated using GNOM ae shifted elative to the cuve fom the PDB file This is due to the contibution to the solution scatteing intensity fom the bulk solvent that the potein eplaces and fom a had wate shell aound the potein 24, 25 Since it is difficult to model these solvent inteactions, in o-

35 3 Altenate 4 35 Altenate 35 3 Altenate 25 3 25 2 25 2 2 15 15 15 1 1 1 5 5 5 5 1 15 2 25 3 35 4 (a) 1i27A 5 1 15 2 25 3 35 (b) 1hp8A2 5 1 15 2 25 3 35 4 (c) 1mj4A Fig 3 Ensemble of 1 altenate cuves (ed) geneated by ou method fo simulated scatteing data to q max = 7 fo the thee efeence domains, along with the cuve calculated fom the x-ay stuctue (blue) 3 25 Altenate 4 35 Altenate 35 3 Altenate 2 3 25 25 2 15 2 15 1 15 1 1 5 5 5 5 1 15 2 25 3 35 4 5 1 15 2 25 3 35 5 1 15 2 25 3 35 4 (a) 1i27A (b) 1hp8A2 (c) 1mj4A Fig 4 Some sample altenate cuves geneated fo the thee efeence domains These cuves ae taken fom the set of cuves shown in figue 3(a) 3(c) This figue shows that the altenate cuves ae smooth and potein-like de to bette evaluate ou method with a gound tuth, we next tun to simulation esults fo which these scatteing contibutions ae not included Fo simulation studies we have selected thee vey diffeent domains, one each of alpha, beta and alpha plus beta poteins unde CATH classification 26, 27 The CATH ids ae 1i27A (Ac Repesso Mutant, subunit A topology), 1hp8A2 (Seminal Fluid Potein PDC-19 (DomainB) topology) and 1mj4A (ubiquitin-like (UB oll) topology) Fo simulated data, we use the scatteing intensity of the potein in vacuum (an output fom CRYSOL 24 ) to q max = 7 We compae against the coesponding cuve computed fom the atomic coodinate file Figues 3(a), 3(b), and 3(c) show 1-membe ensembles calculated fom simulated data, while Figues 4(a), 4(b), and 4(c) show selected examples The altenate cuves ae smooth, validating ou appoach of modeling smoothness as a constaint on the cuves as opposed to a citeion to be optimized We have tested the obustness of ou method by adding andom Gaussian noise to the simulated scatteing cuves At each q value, we added a andom noise c k σ(q), whee c is a constant, k is a andom vaiable that follows a standad nomal distibution and σ(q) is the standad deviation in the intensity value With a small adjustment to the eo bounds fo scatteing-intensity constaint, ou method obustly handled noise as we inceased c We tested fo c in the ange [, 5] in incements of 1 We found that adding a noise of c standad deviations at each q value equied a coesponding incease in eo toleance

35 3 25 Ensemble Gnom Gnom Default Gnom Best 4 35 3 Ensemble Gnom Gnom Default Gnom Best 35 3 25 Ensemble Gnom Gnom Default Gnom Best 2 25 2 15 1 5 2 15 1 5 15 1 5 5 5 1 15 2 25 3 35 4 (a) 1i27A 5 5 1 15 2 25 3 35 (b) 1hp8A2 5 5 1 15 2 25 3 35 4 (c) 1mj4A Fig 5 Ensembles geneated using ou method (ed) and by vaying the egulaization paamete in GNOM 14 (black) Table 1 Divesity in ou ensemble vs one geneated by vaying the egulaization paamete in GNOM 14 d(gnom, Polytope) is the minimum distance of a cuve in GNOM to one in ou ensemble, and similaly fo d(polytope, GNOM) Potein Aveage Aveage d(gnom Default, d(gnom Best, d(gnom, Polytope) d(polytope, GNOM) Polytope) Polytope) 1i27A 298 ± 11 513 ± 99 292 295 1hp8A2 227 ± 58 28 ± 73 225 225 1mj4A 244 ± 62 32 ± 61 2732 2732 32 Compaison with GNOM Ou main appoach is to descibe completely the set of ealistic cuves using linea constaints Once the set has been descibed, it can be sampled to find an ensemble of consistent and ealistic cuves Thee ae two advantages to this appoach: all geneated samples satisfy the constaints, and we can develop algoithms to get a divese sampling within the polytope of consistent cuves It is possible to geneate an ensemble of cuves using the existing GNOM softwae, by vaying the egulaization paamete 14, although this is not the typical use of the egulaization paamete and geneating ensembles is not the intended use of GNOM To geneate a GNOM deived ensemble we sampled the egulaization paamete unifomly in log space Of the cuves so geneated, we conside only those that GNOM classified as GOOD We also geneated two additional cuves using GNOM: GNOM Default, the output of GNOM when un with default paametes and GNOM Best, the output GNOM poduces when it is povided the coect OSCILL and VALCEN citeia 14 detemined fom the calculated cuve Table 1 compaes the divesity of the ensembles geneated by ou method and by GNOM Fo evey cuve in the ensemble fom GNOM we calculated the distance of the closest cuve in ou ensemble, and vice vesa The second and thid columns in Table 1 show the aveage ove these minimum distances, while the final two columns show the distances to the default and best cuve fom the cuves in ou ensemble The aveage minimum distance of a cuve fom GNOM s ensemble to that fom ou polytope ensemble is shote than vice vesa Thus, thee ae cuves in ou ensemble that do not have a coespondingly close cuve in GNOM s ensemble We attibute this geate divesity to the natue of ou sampling Ou sampling method samples points fom the vetices of the convex polytope in the coefficient space These points, by definition, coespond to cuves that ae most divese without violating the constaints in section 21 Figues 5(a) 5(c) show the ensembles geneated using ou appoach and those geneated by GNOM The GNOM cuves ae closely banded when compaed to the cuves geneated fom ou ensemble This gaphically validates the esults of Table 1; most cuves in the GNOM band have a close countepat in ou ensemble, but cuves in ou ensemble do not

always have a close countepat to those in GNOM The oscillations and non-negative values in the GNOM cuves eflect ou focing the egulaization paamete to unusual values Ou method geneates a divese set without these poblems because it explicitly enfoces the smoothness and non-negativity constaints at each value 33 Stuctual implications of divesity In ode to investigate the divesity of ou ensemble we compae it with a divese set of epesentative potein stuctues We define a set of 184 divese stuctues, which we call CATHRep, by selecting the CATH epesentative stuctue fo each diffeent topology We geneated both the paticle scatteing cuves (using CRYSOL) and inteatomicdistance distibution cuves (fom the pdb file) fo all CATHRep poteins Then we calculated the I(q) distance between each CATHRep potein and each of ou example poteins, using RFacto 28 with unifom weights Similaly, we calculated the distance between the cuves fo the thee examples and all CATHRep poteins, in this case measuing distance as the aea between the two cuves We epeated the pocedue fo each membe of ou geneated ensemble, using equation 4 to detemine coesponding I(q) cuves Figues 6(a) 6(c) plot the log of the I(q) distance vs the log of the distance, whee the blue dots coespond to CATHRep poteins and the othe makes coespond to ou ensemble at diffeent values fo the intensity constaints It is clea that the scatteing cuves fo the altenate cuves ae much close than those fo poteins in CATHRep, a diect consequence of the scatteing-intensity constaint Howeve, thee ae stuctues in CATHRep whose cuves ae close to the cuves of the thee examples than the altenate cuves ou method geneates (as illustated by the points below hoizontal lines) This shows that the although the altenate cuves give ise to scatteing cuves that ae almost identical to those fom the actual stuctue, they ae significantly diffeent fom the cuves fo the actual stuctues when compaed to the vaiability seen in CATHRep Although all thee of the examples show consideable vaiability in candidate cuves, table 2 shows that this vaiability in cuves is not unifom; 1hp8A2 has fa fewe stuctues in CATHRep with close cuves than does 1i27A The thid example 1mj4A falls somewhee between these two An inteesting extension of ou wok might be to evaluate diffeent I(q) cuves fo thei potential to geneate cuves with diffeent divesity Table 2 summaizes the vaiability with espect to CATHRep Fo evey ensemble we calculated the maximum and median distance of the cuves in the ensemble to that of the efeence potein s cuve Table 2 shows the numbe of stuctues in CATHRep that have a distance to the efeence stuctue below these thesholds These esults show that the divesity of in ou ensemble is lage enough to ovelap with a substantial numbe of epesentative stuctues We note that most stuctues in CATHRep have lage diffeences in both and I(q) space because of the natue of CATHRep, which is supposed to epesent a divese set of potein stuctues Theefoe, fo the examples we have consideed the ovelap is limited to 68 out of the 184 stuctues in CATHRep Ou method explicitly bounds the eo at each q value fo the scatteing cuve pedicted using a cuve It it natual to ask if such constaints encouage cuves that diffe by the maximal possible deviation at each point In pactice, this does not appea to be the case fo the ensembles ou method geneates Typically the χ 2 distance between the scatteing cuves fom the ensemble and those fo the efeence poteins was small When a deviation of σ was allowed at evey q value, the maximum χ 2 distance ove ensembles fo all thee stuctues was 357 When a deviation of σ/2 was pemitted this value was 135, and fo ensembles with 2σ deviations pemitted it was 4978 These χ 2 values fall much below the pemitted deviation at each point; even if a deviation of one σ is pemitted at evey point, not evey point on the scatteing cuve vaies by that amount 4 DISCUSSION AND CONCLUSIONS We have descibed the fist method fo chaacteizing an ensemble of cuves that ae consistent with a given scatteing pofile At the heat of ou appoach

log( distance) 5 1 15 2 25 3 35 4 45 5 18 16 14 12 1 8 6 4 2 log(i(q) distance) (a) 1i27A log( distance) 5 1 15 2 25 3 35 4 45 15 1 5 log(i(q) distance) (b) 1hp8A2 log( distance) 5 1 15 2 25 3 35 4 45 5 18 16 14 12 1 8 6 4 2 log(i(q) distance) (c) 1mj4A Fig 6 Similaity between efeence poteins and poteins in the CATH epesentative database, in tems of I(q) (x-axis) and (y-axis) Ensemble membes ae plotted fo the example poteins, pemitting an eo of σ/2 (geen tiangles), σ (ed cicles), o 2σ (black plus signs) A single blue dot is plotted fo each CATH epesentative Geen, ed, and black hoizontal lines coespond to the median values fo distances fo the thee sets of altenative cuves Table 2 Numbe of membes of CATHRep with distance below the maximum o median distance in the geneated Domain Max Eo #CATHRep stuctues max median σ/2 4 18 1i27A σ 38 2 2σ 68 39 σ/2 13 7 1hp8A2 σ 14 7 2σ 13 12 σ/2 19 16 1mj4A σ 2 19 2σ 32 19 is the idea of epesenting the desiable popeties of the cuves as constaints on the set of solutions Such a fomulation allows us to descibe all ealistic and consistent cuves as occupying a convex polytope in a high-dimensional space We use linea pogamming to sample this space and apidly geneate a divese ensemble of cuves In this section we discuss pactical issues in implementing this appoach, limitations, and possible futue diections Any appoach with a few use-defined paametes faces the poblem of appopiately selecting those paametes If the smoothness and continuity paametes ae too small, the linea pogam becomes infeasible; too lage, and the cuves become jagged Pactically, we found that a binay seach fo these paametes woked easonably well Anothe paamete is the numbe of basis functions used to epesent the cuves We obtained the smoothest solutions when the numbe of basis functions was the minimum equied to satisfy the linea pogam The minimum numbe of basis functions also epesents the smallest ensemble that satisfies the linea pogam One can imagine constaints descibing othe desiable popeties of cuves beyond those we consideed So long as those constaints ae linea, they may be easily added to the existing famewok Highe-ode constaints might be added while still maintaining convexity and contiguity of the set of feasible cuves Thee ae limitations in the sampling technique we applied Sampling unifomly in high dimensions is inheently had, and we cannot claim to have a unifom coveage of the consistent space This fomulation does not indicate how big (o small) the polytope is, which depends on the use-defined paametes fo vaious constaints as well as the paticula popeties of the tansfom matix defined in

equation 5 Pehaps most impotantly, we need to be awae that changes in some diections in space may affect the coesponding I(q) cuves moe than changes in othe diections We now discuss the esults of ou appoach in the wide context of ab initio stuctue pediction using SAXS The mapping fom stuctue to to I(q) is not one-to-one, and evey node in CATHRep epesents a unique topology Theefoe, combining these two facts, ou esults in section 3 show that such altenate cuves might coespond to poteins with significant stuctual diffeences This obsevation should be used both as a caution and as an oppotunity fo stuctue elucidation fom solution scatteing data As a caution, it eminds us that divese altenate stuctues can give ise to simila scatteing cuves As an oppotunity, ou esults can be viewed as a fist step in poducing a complete ensemble of stuctues compatible with a given scatteing cuve 5 ACKNOWLEDGEMENT This wok was suppoted in pat by a gant fom NSF SEIII (IIS-5281) to CBK, AMF, and BAC Refeences 1 M E Wall, S C Gallaghe, J Tewhella: Lage- Scale Shape Changes in Poteins and Macomolecula Complexes Annual Review of Physical Chemisty, 2, Oct; 51:35 38 2 D I Svegun, H B Stuhmann: New developments in diect shape detemination fom small-angle scatteing 1 Theoy and model calculationsacta Cystallogaphica Section A, 1991, Nov; 47:736 744 3 D Walthe, F E Cohen, S Doniach: Reconstuction of low-esolution thee-dimensional density maps fom one-dimensional small-angle X-ay solution scatteing data fo biomolecules Jounal of Applied Cystallogaphy, 2, Ap; 33:35 363 4 P Chacón, F Moán, J F Díaz, E Pantos, J M Andeu: Low-Resolution Stuctues of Poteins in Solution Retieved fom X-Ray Scatteing with a Genetic Algoithm Biophysical Jounal; 1998, 74, 6:276 2775 5 M V Petoukhov, D I Svegun: Global Rigid Body Modeling of Macomolecula Complexes against Small-Angle Scatteing Data Biophysical Jounal;25, May, 89, 2:1237 125 6 T R Sosnick, J Tewhella: Denatued states of ibonuclease A have compact dimensions and esidual seconday stuctue Biochemisty; 1992, Sep, 31, 35: 8329 8335 7 D J Segel, A Bachmann, J Hofichte, K O Hodgson, S Doniach, T Kiefhabe: Chaacteization of tansient intemediates in lysozyme folding with time-esolved small-angle X-ay scatteing Jounal of Molecula Biology; 1999, May, 288 3:489 499 8 G Olah, R D Mitchell, T R Sosnick, D A Walsh, J Tewhella: Solution stuctue of the campdependent potein kinase catalytic subunit and its contaction upon binding the potein kinase inhibito peptide Biochemisty; 1993, Ap, 32, 14: 3649 3657 9 T E Williamson, B A Caig, E Kondashkina, C Bailey-Kellogg, A M Fiedman: Analysis of selfassociating poteins by singula value decomposition of solution scatteing data Biophysical Jounal; 28, 94: 496 4923 1 D I Svegun, V V Volkov, M B Kozin, H B Stuhmann: New Developments in Diect Shape Detemination fom Small-Angle Scatteing 2 Uniqueness Acta Cystallogaphica Section A; 1996, May, 52, 3: 419 426 11 D I Svegun, M V Petoukhov, M B Kozin: Detemination of Domain Stuctue of Poteins fom X-Ray Solution Scatteing Biohysical Jounal; 21, Jun, 8, 6: 2956 2953 12 M V Petoukhov, D I Svegun: New methods fo domain stuctue detemination of poteins fom solution scatteing data Jounal of Applied Cystallogaphy; 23, Jun, 36, 3 Pat 1:54 544 13 A Guinie, G Founet: Small-Angle Scatteing of X-ays Wiley, 1955 14 D I Svegun: Detemination of the egulaization paamete in indiect-tansfom methods using peceptual citeia Jounal of Applied Cystallogaphy; 1992, Aug, 25, 4: 495 53 15 P B Mooe: Small-angle scatteing Infomation content and eo analysis Jounal of Applied Cystallogaphy; 198, Ap, 13, 2:168 175 16 O Glatte: A new method fo the evaluation of small-angle scatteing data Jounal of Applied Cystallogaphy; 1977, Oct, 1, 5: 415 421 17 S Steenstup: Deconvolution in the pesence of noise using maximum entopy pinciple Austalian Jounal of Physics; 1985, 38: 319 327 18 S Steenstup, S Hansen: The maximum-entopy method without the positivity constaint applications to the detemination of the distance-distibution function in small-angle scatteing Jounal of Applied Cystallogaphy, 1994, Aug, 27, 4: 574 58 19 D I Svegun, A V Semenyuk, L A Feigin: Smallangle-scatteing-data teatment by the egulaization method Acta Cystallogaphica Section A; 1988, May, 44, 3: 244 25 2 A N Tikhonov: On the stability of invese poblems Doklady Akademii Nauk SSSR; 1943, 39, 5: 195 198 21 A N Tikhonov, V A Asenin: Solution of Ill-posed

Poblems Wiley,1977 22 A Guinie: La diffaction des ayons X aux tés petits angles, application á létude de phénoménes ultamicoscopiques Ann Physique; 1939, 12: 161 237 23 D Avis and K Fukuda: A Pivoting Algoithm fo Convex Hulls and Vetex Enumeation of Aangements and Polyheda Discete & Computational Geomety; 1992, 8: 295-313 24 D Svegun, C Babeato, M H J Koch:CRYSOL a Pogam to Evaluate X-ay Solution Scatteing of Biological Macomolecules fom Atomic Coodinates Jounal of Applied Cystallogaphy, 1995, Dec, 28, 6: 768 773 25 D I Svegun, M H Koch: Small-angle scatteing studies of biological macomolecules in solutionrepots on Pogess in Physics; 23, 66, 1: 1735 1782 26 C A Oengo, A D Michie, S Jones, D T Jones, M B Swindells, JM Thonton: CATH- A Hieachic Classification of Potein Domain Stuctues Stuctue, 1997, Aug, 5, 8: 193-118 27 F M G Peal, C F Bennett, J E Bay, A P Haison, N Matin, A Shephed, I Sillitoe, J Thonton, C A Oengo: The CATH database: an extended potein family esouce fo stuctual and functional genomics Nucl Acids Res; 23, 31, 1:452-455 28 A V Sokolova, V V Volkov, D I Svegun: Pototype of a database fo apid potein classification based on solution scatteing data Jounal of Applied Cystallogaphy; 23, Jun, 36, 3 Pat 1: 865 868