PyMod Documentation (Version 2.1, September 2011)

Size: px

Start display at page:

Download "PyMod Documentation (Version 2.1, September 2011)"

Toby Martin Cannon
5 years ago
Views:

1 PyMod User s Guide

2 PyMod Documentation (Version 2.1, September 2011) Emanuele Bramucci & Alessandro Paiardini, Francesco Bossa, Stefano Pascarella, Department of Biochemical Sciences A. Rossi Fanelli, Sapienza University of Rome, Italy

3 Table of Contents 1 Introduction Installation Windows (XP/Vista/Seven) Mac OS (10.5+) Linux (Ubuntu 10+) PyMod Overview Components Similarity search Alignment of sequences and structures Homology Modeling Usage Example Modeling the dihydrofolate reductase from Mycobacterium avium References... 17

4 1 Introduction A simple and intuitive interface, PyMod, between the popular molecular graphics system PyMOL [1] and several other tools (i.e., (PSI-)BLAST [2], MUSCLE [3], ClustalW [4], CEalign [5] and MODELLER [6]) has been developed, to show how the integration of the individual steps required for homology modeling and sequence/structure analysis within the PyMOL framework can hugely simplify these tasks. Sequence similarity searches, multiple sequence and structural alignments generation and editing, and even the possibility to merge sequence and structure alignments have been implemented in PyMod, with the aim of creating a simple, yet powerful tool for sequence and structure analysis and building of homology models. 2 Installation 2.1 Windows (XP/Vista/Seven) 1. The first step is to check which is the Python version of your PyMOL. Type import sys; print sys.version in the PyMOL console and watch the first number (e.g "2.7"). 2. Retrieve the Windows Installer specific for your Python version (from step 1) from the Download page ( schubert.bio.uniroma1.it/pymod/download.html). 3. The installer will guide you during the installation process. Remember to register MODELLER to get a license key. 4. When you have finished you will be able to see PyMod from the plugin menu of PyMOL. 2.2 Mac OS (10.5+) (Beta test - some functions may be missing) If you have Mac OS X 10.5 you need to use PyMOL The first step is to check which is the Python version of your PyMOL. Type import sys; print sys.version in the PyMOL console and watch the first number (e.g "2.7"). 2. Retrieve the Mac package specific for your python version (from step 1) from the Download page. 3. Unzip the package and copy the content of the "modules" and "startup" directories respectively in your "modules" and "startup" folders that usually can be found at: PyMOLX11Hybrid.app/pymol/modules/pmg_tk/startup

5 4. Download and install ClustalW (ftp://ftp.ebi.ac.uk/pub/software/clustalw2/). 5. Download and install MODELLER ( Remember to register to get a license key. If you have installed PyMOL 1.4 you need MODELLER version 9.9 or greater. 6. (Not required if you have PyMOL 1.4 and python version 2.7 [from step 1]). The final step is the setup for the CEAlign module. You can compile ccealign from the source (a) or try the "quick and dirty" method (b): a. Go to your ".../pmg_tk/startup/pymod/cealign". Open a shell and type sudo python setup.py build Now the compiler has generated a folder named "build". Inside this folder there is a directory with a name based on your OS and Python version (e.g. "lib.linux-x86_64-2.6"). Inside this directory copy the file "ccealign.so" and paste it in ".../startup/pymod". b. Go to your ".../pmg_tk/startup/pymod/cealign" and rename the file "ccealign-version- 10.X.so" (10.X is the version of your OS) to "ccealign.so" and copy it in ".../startup/pymod". 2.3 Linux (Ubuntu 10+) (Beta test - some functions may be missing) 1. Retrieve the Linux package from the Download page and unzip all the files in the "startup" folder of PyMOL. It might be under: /var/lib/python-support/python2.x/pmg_tk/startup/ 2. Open the Synaptic package manager (System--->Administration--->Synaptic package manager) and download these packages: a. Clustalw b. Biopython c. Python-dev (this is important for the last step) 3. Download and install MODELLER ( Remember to register to get a license key. If you have installed PyMOL 1.4 you need MODELLER version 9.9 or greater. 4. The final step is the setup for the CEAlign module. You have to compile ccealign from the source (this is why you have downloaded Python-dev in step 2): a. Go to your ".../pmg_tk/startup/pymod/cealign" b. Open a shell and type "sudo python setup.py build"

6 c. Now the compiler has generated a folder named "build". Inside this folder there is a directory with a name based on your OS and Python version (e.g. "lib.linux-x86_64-2.6"). d. Inside this directory copy the file "ccealign.so" and paste it in ".../startup/pymod"

7 3 PyMod Overview Sequence Database search (Psi-)BLAST Sequence alignment MUSCLE - ClustalW Sequences Structures Structural alignment CE align Structure-based multiple sequence alignment Homology Modeling MODELLER 3D-Structure Figure 1. Flowchart representing PyMod workflow. Every step can be considered as standalone, e.g. you don t need to use BLAST (for sequence retrieving) before aligning (with ClustalW or MUSCLE) two or more sequences. Algorithms used are highlighted in red.

8 3.1 Components PyMod has a rich functionality, based on its core sequence alignment, clustering and editing window. These features are described in the following sub-sections Similarity search BLAST - ( ) The BLAST algorithm is a heuristic program, which means that it relies on some smart shortcuts to perform the search faster. BLAST performs "local" alignments. Most proteins are modular in nature, with functional domains often being repeated within the same protein as well as across different proteins from different species. The BLAST algorithm is tuned to find these domains or shorter stretches of sequence similarity (McEntyre J, Ostell J: The NCBI Handbook, PSI-BLAST - ( &RUN_PSIBLAST=on ) Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins or new members of a protein family. Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits. The first round of PSI-BLAST is a standard protein-protein BLAST search. The program builds a position-specific scoring matrix (PSSM or profile) from a multiple alignment of the sequences returned with Expect values better (lower) than the inclusion threshold (default=0.005). The PSSM will be used to evaluate the alignment in the next iteration of search. Any new database hits below the inclusion threshold are included in the construction of the new PSSM. A PSI-BLAST search is said to have converged when no more matches to new database sequences are found in subsequent iterations ( ogselectionguide#tab31) Alignment of sequences and structures MUSCLE - ( ) MUSCLE is a program for creating multiple alignments of amino acid or nucleotide sequences. A range of options is provided that give you the choice of optimizing accuracy, speed, or some compromise between the two ( ClustalW - ( ) ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins. It attempts to calculate the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen (

Cealign - ( http://cl.sdsc.edu/ce.html ) CE is a method for calculating pairwise structure alignments.

9 Cealign - ( ) CE is a method for calculating pairwise structure alignments. CE aligns two polypeptide chains using characteristics of their local geometry as defined by vectors between C alpha positions. Matches are termed aligned fragment pairs (AFPs). Heuristics are used in defining a set of optimal paths joining AFPs with gaps as needed. The path with the best RMSD is subject to dynamic programming to achieve an optimal alignment ( Homology Modeling Modeller ( MODELLER is used for homology or comparative modeling of protein three-dimensional structures. The user provides an alignment of a sequence to be modeled with known related structures and MODELLER automatically calculates a model containing all non-hydrogen atoms. MODELLER implements comparative protein structure modeling by satisfaction of spatial restraints, and can perform many additional tasks, including de novo modeling of loops in protein structures, optimization of various models of protein structure with respect to a flexibly defined objective function, multiple alignment of protein sequences and/or structures, clustering, searching of sequence databases, comparison of protein structures, etc. ( 4 Usage Example 4.1 Modeling the dihydrofolate reductase from Mycobacterium avium Go to the NCBI web site and search for the dihydrofolate reductase from Mycobacterium avium (GI: Download the sequence file in FASTA format. Launch PyMOL and select PyMod from the PyMOL Plugin menu. From the main window of PyMod select File Sequences Add from file and choose the fasta file that you have downloaded before. The sequence will be imported in the plugin, as showed in fig. 2 Figure 2. PyMod main window. The next step involves the database search for homologous sequences corresponding to an experimentally solved 3D structure. To perform this task we will use the BLAST function:

Select the sequence by left-clicking on its header (in the PyMod left panel - it will become green). From the Tools menu select BLAST; a preference window will appear (fig. 3).

This operation could take several minutes, depending on sequence length and speed of your internet connection. Figure 3. BLAST Preferences window.

10 Select the sequence by left-clicking on its header (in the PyMod left panel - it will become green). From the Tools menu select BLAST; a preference window will appear (fig. 3). It is possible to modify several parameters; however, in this tutorial we can just keep values at their default and submit. This operation could take several minutes, depending on sequence length and speed of your internet connection. Figure 3. BLAST Preferences window. After the database search task has done, the results window will show up; here, you can choose to import one or more sequences (fig. 4). As you can see in this example, the first entry has 100% identity with our query sequence; this is due to the fact that the dihydrofolate reductase of Mycobacterium avium has been already experimentally solved. We will ignore this entry and use it later to validate our results. For this tutorial, we will choose two proteins as templates for modeling task, i.e., dihydrofolate reductase from Bacillus anthracis (PDB code: 3JW3; 33.94% sequence identity with our query) and dihydrofolate reductase from Moritella profunda (PDB code: 2ZZA; 40,80% sequence identity with our query). Select these proteins using the checkbox and press Submit. Figure 4. BLAST output window.

11 Your selected sequences will be imported in PyMod main window, and clustered with your query sequence. You can expand or collapse this cluster by clicking on the + button that is placed beside your query sequence. Expand your cluster and download the corresponding PDB structures by right-clicking on each sequence header and select Get PDB File (fig. 5). After a few seconds PyMod will automatically import the structures inside PyMOL and it will split them by chain in PyMod main window (fig. 6) Figure 5. Get PDB File function. Figure 6. Structures imported in PyMOL and split by chain.

You can select all the sequences that you don t want to work with (by left-clicking them) and then delete the selection through the pop-up menu on the left panel of PyMod window (you can see this

Although the increase of accuracy when making use of multiple structural templates is still a matter of debate, during the years it has been claimed that this approach is able to better capture the

12 You can select all the sequences that you don t want to work with (by left-clicking them) and then delete the selection through the pop-up menu on the left panel of PyMod window (you can see this option in fig. 5 in that case it was not clickable because only one sequence was selected). Here we will leave A chains, and delete the other ones. Although the increase of accuracy when making use of multiple structural templates is still a matter of debate, during the years it has been claimed that this approach is able to better capture the variability and divergence of natural structures [7]. When modeling with multiple templates, it is mandatory to superpose them as a first step, and then derive a structure-based sequence alignment. To accomplish this task, select the headers of the protein 3JW3 and 3IA4 and click on Tools CE struct alignment (fig. 7). Figure 7 CE align function A dialog box will appear, asking if you want to use sequence information in the Combinatorial Extension algorithm. Using sequence information will increase the probability that similar amino acids will be structurally superposed. Press YES in the dialog box. After a few seconds the structures will be superposed in PyMOL and the derived structure-based sequence alignment will be shown in PyMod (fig. 8) Figure 8. Structural alignment performed with the Combinatorial Extension algorithm.

After the structural templates have been aligned, add the query sequence to the alignment. To accomplish this task you can choose between two different tools: ClustalW and MUSCLE.

As usual, the preferences window will appear allowing you to modify some of the most important parameters of the algorithm. We can just keep values at their default and submit.

13 After the structural templates have been aligned, add the query sequence to the alignment. To accomplish this task you can choose between two different tools: ClustalW and MUSCLE. In this case we will use the first algorithm. Select all the sequences by left-clicking on their header and click Tools ClustalW. As usual, the preferences window will appear allowing you to modify some of the most important parameters of the algorithm. We can just keep values at their default and submit. A dialog box will appear asking if we want to keep the previously obtained structural alignment. Since we would like to keep the structural alignment in-frame (i.e., adding indels, when necessary, to both templates), click Yes. At this point the structural and sequence alignments will be merged together. As refinement step we want to delete the C-terminal overhang; right-click the query sequence and select Edit Sequence. In the Edit sequence window just delete the last amino acids as shown in fig. 9 and press Submit. Edit the other sequences to delete their overhangs. Figure 9. Sequence editor window. After a multiple alignment has been obtained, we can proceed with the last step of the flowchart, model building. But, just before performing this last task, we will manually check the alignment to pinpoint potential misaligned regions. Indeed, scrolling the alignment till the C-terminal region (approximately near ASP 130 of the query sequence) we notice four consecutive ASP residues that are not present in the structural templates. This suggests a possible indel in this region. Modify the alignment as shown in Fig. 10, by left-clicking on a sequence and dragging to the right or to the left respectively to create or remove an insertion. Figure 10. Refining the alignment. The next step is the homology model building. Select the query sequence and click Tools Modeller. In the options window (Fig. 11) choose both templates and set to High the

When Modeller has done, the homology model will be automatically imported in PyMOL main window (Fig.

14 optimization level. Make sure to include heteroatoms (i.e., ligands or cofactors) during the model building. Click SUBMIT. This operation could take several minutes. Figure 11. Modeller option window. When Modeller has done, the homology model will be automatically imported in PyMOL main window (Fig. 12) and a DOPE score-based graph will appear for an energetic validation of the model (Fig. 13). Figure 12. Homology model imported in the PyMOL main window.

Figure 13. DOPE score-based graph. Now we can compare the obtained model with the experimentally-solved 3D structure.

Structures will be superposed as shown in Fig. 14. Figure 14. Superposition of the obtained model with the experimentally-solved 3D structure.

In cyan: experimentally-solved dihydrofolate reductase of Mycobacterium avium (PDB code: 2W3W).

15 Figure 13. DOPE score-based graph. Now we can compare the obtained model with the experimentally-solved 3D structure. Click on Plugin PDB Loader Service from PyMOL menu and type 2W3W. Now click on the A near the 2w3w code and choose Align to molecule 1_gi_ Structures will be superposed as shown in Fig. 14. Figure 14. Superposition of the obtained model with the experimentally-solved 3D structure. In white: model of the dihydrofolate reductase of Mycobacterium avium. In cyan: experimentally-solved dihydrofolate reductase of Mycobacterium avium (PDB code: 2W3W). As we can see, our model contains only a few mistakes in the external loops but has a great consistency with the experimentally-solved structure in the core region and the active site. It s also important to stress that the ability to build a model including heteroatoms allows the right orientation of side chains in the active site, as shown in fig. 15.

16 Figure 15. In the picture is shown the correct orientation of side chains that interact with the cofactor in the active site of the protein. In white: model of the dihydrofolate reductase of Mycobacterium avium. In cyan: experimentally-solved dihydrofolate reductase of Mycobacterium avium (PDB code: 2W3W).

17 5 References 1. DeLano WL: The PyMOL Molecular Graphics System. San Carlos, CA: DeLano Scientific Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ: Basic local alignment search tool. J. Mol. Biol , Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32(5): Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22, Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 9, Eswar N, Marti-Renom, MA, Webb B, Madhusudhan MS, Eramian D, Shen M, Pieper U, Sali A: Comparative Protein Structure Modeling With MODELLER. Current Protocols in Bioinformatics 2006, Supplement 15, Venclovas Č, Zemla A, Fidelis K, Moult J: Assessment of progress over the CASP experiments. Proteins 2003, 53, Suppl 6:

PyMod 2. User s Guide. PyMod 2 Documention (Last updated: 7/11/2016)

PyMod 2. User s Guide. PyMod 2 Documention (Last updated: 7/11/2016) PyMod 2 User s Guide PyMod 2 Documention (Last updated: 7/11/2016) http://schubert.bio.uniroma1.it/pymod/index.html Department of Biochemical Sciences A. Rossi Fanelli, Sapienza University of Rome, Italy