GPRO 1.0 THE PROFESSIONAL TOOL FOR SEQUENCE ANALYSIS/ANNOTATION AND MANAGEMENT OF OMIC DATABASES. (February 2011)

Size: px

Start display at page:

Download "GPRO 1.0 THE PROFESSIONAL TOOL FOR SEQUENCE ANALYSIS/ANNOTATION AND MANAGEMENT OF OMIC DATABASES. (February 2011)"

Allan Watkins
6 years ago
Views:

1 The user guide you are about to check may not be thoroughly updated with regard to the last downloadable version of the software. GPRO software is under continuous development as an ongoing effort to improve and satisfy user s requirements. In this version of the user guide you will find useful explanations about the general GPRO layout, interface and menu implementations. Many GPRO functions are described in detail but recently included features may not be updated. We are working to keep the user guide as updated as possible but in case you find insufficient or lacking information on the GPRO topic you are interested in do not hesitate to contact our technical support team (gydb@gydb.org). GPRO 1.0 THE PROFESSIONAL TOOL FOR SEQUENCE ANALYSIS/ANNOTATION AND MANAGEMENT OF OMIC DATABASES (February 2011) Ricardo Futami 1, Alfonso Muñoz-Pomer 1,2, Jose Maria Viu 1, Laura Dominguez- Escribá 1, Laura Covelli 1, Guillermo Pablo Bernet 1, José María Sempere 2, Andrés Moya 3,4, Carlos Llorens 1 1- Biotechvana, Parc Cientific de la Universitat de València 2-Departamento de Sistemas Informáticos y Computación (DSIC), Universitat Politècnica de València 3-Unidad Mixta de Investigación en Genómica y Salud del Centro Superior de Investigación en Salud Pública (CSISP)-Universitat de València (Instituto Cavanilles de Biodiversidad y Biología Evolutiva) 4-CIBER en Epidemiologia y Salud Pública (CIBEResp) Authors: RF: ricardo.futami@biotechvana.com AMP: alfonso.munozpomer@biotechvana.com JMV:josemaria.viu@biotechvana.com LD: laura.dominguez@biotechvana.com LC: laura.covelli@biotechvana.com GPB: guillermo.bernet@biotechvana.com JMS: jsempere@dsic.upv.es AM: andres.moya@uv.es CLL: carlos.llorens@biotechvana.com Copyright (2010) Biotechvana Parc Cientific, Universitat de València, Paterna (Valencia) Spain 1

2 TABLE OF CONTENTS 1. INTRODUCTION AND OVERVIEW ACQUIRING AND INSTALLING GPRO Copyright and License System requirements Installation Warranty and Technical Support LAYOUT AND PIPELINE FEATURES Pipeline and remote computing cluster Layout THE MENU: COMMANDS AND FUNCTIONS Databases SELECT DIRECTORY SINGLE FASTA FILE GENBANK ACCESSION OPEN WORKSHEET NEW WORKSHEET Directory Editing TIME SEQUENCE EDITOR Layout TIME menu File Edit Translate Orientation and Geometry Find ORFs Find motifs DATABASE EDITOR File Search and replace Export and Remove Find Motifs Omic analyses FORMAT DATABASES BLAST ANALYSES HMM ANALYSES PROCESSING/RETRIEVING RESULTS FROM BLAST AND HMM OUTPUTS Alignment Analyses HMMs AND CONSENSUS SEQUENCES SEQUENCE LOGOS I

3 4.6. Management FILES AND FOLDERS Join Folders Join Files Split Files FIND SEQUENCES ALIGNMENTS Join alignments Format alignments PIPELINE CONNECTION SETTINGS WORKSHEET PREFERENCES Help WORKSHEET ANNOTATION SYSTEM File OPEN WORKSHEET SAVE SAVE AS SET AS DEFAULT WORKSHEET EXPORT Export CSV & FASTA Exporting annotations Exporting rows in categories and clusters Search and replace Sorting/Filtering Import APPEND WORKSHEET COMBINE WORKSHEETS CLUSTERS Export options GENE ONTOLOGY (GO) COG/KOGS APPLY ANNOTATION COLORS SWITCH GI/ACCESSION Selecting rows BY KEY TERMS: BY EXPECT OR STATISTICS VALUES: BY COLOR REMOVE SELECTED SEQUENCES Associate database ASSOCIATE DATABASE REMOVE ASSOCIATION MOUSE FUNCTIONS AND TRICKS Directory and FTP Database Editor Worksheet II

4 7. ACKNOWLEDGEMENTS CITING GPRO REFERENCE LIST III

5 1. INTRODUCTION AND OVERVIEW The analysis of omic data (including genomics, proteomics and metabolomics) is closely connected to the availability of comparative information stored in various online initiatives [1-7]. A variety of software applications and web servers (for example [8-12]) are available to massively analyze and annotate biological information derived from omic projects by assigning function, taxonomy and biological categories to the newly characterized sequences. Furthermore, in the contemporary post-genomic biomedical era, advances in next generation sequencing technologies (NGS) have allowed researchers to obtain and analyze thousands or millions of sequencing reads simultaneously. However, the management and editing of the multiple distinct database files generated from these projects is a daunting task, requiring implementation of automatic pipelines and tool suites to generate protocols capable of massively and simultaneously analyzing genes/proteins and other genomic features such as repeat variations (see [13-18]), mobile genetic elements (MGEs) [19-23], domains and modules [24-26], exon-intron segmentation [27], ontology [28-30] and complexity [31,32]. Bioinformaticians spend unnecessary time concatenating distinct scripts usually designed ad hoc to automate the management and labeling of data files and sequence information contained therein. Taking these aspects into primary consideration, this project concerns the development of Gypsy Database PROfessional (GPRO), a software for the annotation, data processing, management and functional analysis of DNA/RNA and protein databases. GPRO consists of installable multifunction software coupled online with an omic pipeline installed on a high-end computing server hosted at the Gypsy Database (GyDB) of Mobile Genetic Elements [1], enabling users to run intensive computational jobs in remote private sessions. The combination software-pipeline implements a powerful annotation management system allowing users to map, annotate and analyze multiple distinct sequences simultaneously using the most common tools [33,34], vocabularies of gene ontology (GO) and ortholog (COG/KOG) classification [28,35,36], and mobile genetic elements (MGEs) databases [37,38]. In addition, GPRO implements a suite of tools providing simplicity and versatility to the management of files and folders. As the amount of genome sequencing data increases (in part because of the advent of next generation DNA sequencing methods [39-44]) there is a growing need for computer software that can manage large volumes of biological data. GPRO is concerned with the molecular analysis of DNA/RNA and protein sequences allowing editing, translation and analysis of sequences up to two Gigabases, and to retrieve ORFs and 1

6 sequence motifs from edited sequences. The current version offers a wide array of analytical tools for annotation manipulation, data mining and sequence tagging. GPRO is accompanied by a user guide to aid understanding but a general overview of the tool capabilities and organization can be gained by reading the paper in which the software is introduced [45]. GPRO is an intuitive tool designed to be operated by any user with or without detailed knowledge of bioinformatics or computational biology. However, the manual assumes familiarization with the most basic concepts of bioinformatics and computational biology (users new to the subject are directed to the following bibliography [46,47]), and the tool implements a wide number of utilities. It is recommended that users read the manual or aforementioned paper before beginning to work with GPRO. GPRO is a work-in-progress; a new release (version 2.0) is in preparation that will address the implementation of new commands oriented to evaluate reads' quality files provided by the distinct available technologies (Illumina, Roche 454, Helicos, Sanger, Solid) and other functionalities including: (a) statistics and graphic representations; (b) a genome browser to superimpose predictive models over sequences; (c) learning algorithms and kernel grammars to concatenate Open Reading Frames (ORFs) repeats; (d) MGE and virus features (cassettes and domains) to automate annotation of intron-exon organized genes, multi-domain proteins, full-length MGEs and endogenous viruses. User feedback concerning this new implementation would be helpful, as would any suggestions concerning new commands and utilities that might be of general interest when working with omic data. Comments can be posted using the "suggestions and bugs form available in the Help tab of the main GPRO menu. 2. ACQUIRING AND INSTALLING GPRO 2.1. Copyright and License GPRO is the intellectual property of Biotech Vana SL (Biotechvana). The software is protected under copyright and intellectual property laws (including international copyright treaties, and other intellectual property treaties). Licensing of GPRO is subject to a commercial and private source agreement that should be accepted during installation of the tool in your PC. This agreement allows unrestricted use of the tool for academic and industrial research and services (online or to third parties) but does not permit the sale, rent, re-distribution, reverse engineering, decompiling, disassembly, or otherwise translation or analysis of any source code of the software, 2

7 underlying ideas, algorithms or programming by any means without explicit authorization from Biotechvana System requirements GPRO is a Java application that runs on personal computers (PCs) and workstations as standalone software. The program is distributed as an installer for Windows XP/Vista/7 (32 bit and 64 bit), a self-extracting disk image for Mac OS X 10.5 or later (64 bit), and a compressed tarball archive for Linux 2.6 kernel series or later (32 bit and 64 bit). Microsoft Windows Windows XP/Vista/7 Intel Pentium GHz or Athlon XP processor or higher 1 GB RAM Linux Linux distributions Intel Pentium GHz or Athlon XP processor or higher 1 GB RAM Apple Mac OS X Mac OS X 10.5 or later Intel Core Duo processor or higher 1 GB RAM All systems require Java 1.5 or later. The latest version can be downloaded from the following URL: Please note that for projects containing a large number of sequences, a minimum of 4 GB of RAM is recommended for optimum performance Installation Microsoft Windows. GPRO 1.0 for Microsoft Windows systems is distributed with an executable installer. To install it, double-click on the installer and follow the instructions on screen. This process automatically generates desktop and start menu shortcuts. To uninstall GPRO, it is recommended to use the Add/Remove Program option in the Windows Control Panel. A dialog box will display a list of programs. Choose GPRO and then click Add/Remove. Apple MacOS X: The MacOS X version is installable via an Apple disk image (DMG) file. Double click the file to mount it and then drag GPRO to your Applications folder. To uninstall, as with any other Mac OS X application, drag it to the Trash. Linux distributions: Installation on Linux systems is carried out by uncompressing a gzipped tarball (.tar.gz archive) in your home directory (or any directory for which you 3

8 have write access). This will create a directory named GPRO. Inside you will find, among others, an executable file that you can run to open GPRO. To uninstall GPRO you need to delete the directory created by the installation process Warranty and Technical Support GPRO has been satisfactorily tested in the various computing systems for which we support the installation. Should your license prove defective, warranty assistance is provided via the technical support service. You can access this service via the "suggestions and bugs" utility implemented within the "Help" tab of the main GPRO menu or by accessing this URL ( Warranty is free of charge for two years after acquiring the license. However, Biotechvana does not provide any warranty or assume any liability or responsibility for the use of and results obtained using this tool, expressed or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. In no event shall Biotechvana be liable for any damages including direct, indirect, incidental, consequential and loss of business profits including general, special, incidental or consequential damages arising out of the use or inability to use the program including but not limited to loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of the program to operate with any other programs, even if Biotechvana has been advised of the possibility of such damages. 3. LAYOUT AND PIPELINE FEATURES GPRO can be thought of as an hybrid between academic initiatives such as BLAST2GO [11] and professional tools such as Geneious [48]. It consists of installable multifunctional software coupled online with an omic pipeline installed on a high-end computing server hosted at GyDB [1], enabling users to run intensive computational jobs in remote private sessions. Basically, GPRO is divided into four major components: PIPELINE, LAYOUT, MENU and WORKSHEET. This section provides a brief description of the first two components (pipeline and layout) and their characteristics Pipeline and remote computing cluster GPRO is coupled with an omic pipeline installed on a high-end remote cluster for running intensive computational jobs. The pipeline includes a user account in the 4

9 remote computing cluster for running intensive computing analyses in private session based on the following services or utilities: 50GB hard disk space, which will increase periodically to guarantee sufficient computational space to fit the requirements of the most demanding projects. A guaranteed quality of service distributed CPU bandwidth for high-throughput computing analyses, providing users with the maximum available processing capacity on the cluster. A SSH client for logging into a user's private account on the remote cluster and sending commands for launching automated analysis tools. An FTP client system organized as a remote Filetree manager for transferring analysis files between a client computer and the remote cluster user's account. Users can upload sequence files for processing on the remote cluster and download generated result files to a local computer. A database compiler tool for BLAST [33] and HMMER [34] servers. A graphical front-end for launching BLAST and HMMER automated batch analyses using pre-compiled or user-generated databases. These tools can be launched in the unattended mode, notifying the user by when the job is complete. A script for processing BLAST and HMMER result files into XML format for automated generation of two CSV (comma separated values) format reports and its associated FASTA sequences according to an E-value cut-off defined by the user. A script to append annotation GO or COG/KOG terms automatically to the mapped sequences listed in the annotation management worksheet system. Figure 1 presents a flowchart providing graphical visualization of the flow of actions between GPRO and the pipeline. By default, the pipeline is hosted at GyDB. However, if you are interested in installing the pipeline suite in your own server or computational cluster, the package of scripts that launch the pipeline, accompanied by directives for installing it and compiling the tools invoked by the pipeline, can be provided. For more detailed information concerning the menus and commands of the GPRO pipeline, see Section 4.4, "Omic Analyses". 5

10 Figure 1. Pipeline-software flowchart 3.2. Layout GPRO implements a friendly-to-use software interface displaying a central menu and an intuitive layout organized in four windows (Figure 2), for the handling and management of analyses and omic projects. MAIN DESKTOP: central working space for editing files and databases, managing omic projects and analyzing sequences. DIRECTORY: users can define a major directory for storing databases and omic projects. This directory can be shown (and hidden) left of the main desktop and organizes files and folders as a Filetree interactive hierarchy that lets users visualize, select and drag any item from the directory to other sections of the tool using the mouse. FTP: File Transfer Protocol (FTP) that allows users to upload and download files and folders (using the computer mouse) from the directory to the remote user account for computing the GPRO pipeline. 6

FASTA EXPLORER: this is a window-based utility coupled with the "Database Editor" that allows users to have visual control and manage the names of the sequences in a text-edited database. Figure 2.

11 FASTA EXPLORER: this is a window-based utility coupled with the "Database Editor" that allows users to have visual control and manage the names of the sequences in a text-edited database. Figure 2. GPRO layout organization and interface implementation. Numbers indicate the four window-based sections as described in the text and visualized in the figure: (1) main desktop; (2) Directory; (3) FTP; (4) FASTA explorer. 4. THE MENU: COMMANDS AND FUNCTIONS As demonstrated in Figure 2, the MENU is located at the top of the software and integrates the following commands: Figure 3. Menu tool bar and functions. Numbers indicate the eight tabs associated with each function as described in the text 7

12 1. DATABASES: this tab allows users to choose a specific custom-directory folder, or open sequence files and databases (only in FASTA format), GenBank accessions and worksheets for managing databases. 2. DIRECTORY: the software implements a desktop called directory to manage the distinct folders and files of an omic project (see Layout Section). Using this tab, users can show or hide the directory at the left of the tool interface. 3. EDITOR: this command launches two editing programs. The first is a Database editor associated with distinct utilities of the editing, mining and management of sequence database files. The second is an implementation of TIME [49], a sequence editor to display, analyze and edit protein and nucleotide sequences of up to 2 x 10 9 bases (two gigabases or amino acids). 4. OMIC ANALYSES: this tab allows users to exploit the pipeline for transferring sequence and HMM (Hidden Markov Model) database files from their computers to their accounts within the remote computing clustering in order to format BLAST databases, perform BLAST/HMM searches against these or other databases and process the outputs into FASTA and annotation files. 5. ALIGNMENT ANALYSIS: Utilities for creating HMM profiles [50], majority rule consensus (MRC) sequences and sequence logos [51] from the input of multiple sequence alignment. 6. MANAGEMENT: this option launches a suite of scripts to manage files and folders in various ways. For instance, users can join, split and rearrange files, folders and contents. In addition, users can execute specific data mining searches in these files and folders, and export the results in new files and folders. 7. PREFERENCES: for selecting the user s preferences with regards to diverse issues (for example, FTP and pipeline connection). 8. HELP: tab for accessing the technical support service, the user guide and the corporative information of GPRO. The following is an extended description of each menu command and its distinct utilities. 8

13 4.1. Databases This command implements five utilities: SELECT DIRECTORY: to select a folder as a directory or change the directory by selecting another folder SINGLE FASTA FILE: to open sequence and database files from any place in your PC. For simplicity, GPRO only works with files in FASTA format GENBANK ACCESSION: to edit GenBank files from your PC or to retrieve them online from GenBank at the National Center for Biotechnology Information (NCBI [52]) OPEN WORKSHEET: to open annotation worksheets (for more details concerning worksheets, see Section 5). For simplicity, the worksheet only opens files in CSV format (plain files with comma-separated values) NEW WORKSHEET: To create a new empty worksheet Directory Users can show or hide the Directory within the GPRO interface (it appears to the left of the main desktop) using this tab Editing GPRO provides two sequence editing programs available in the Editor command, the third icon on the main menu. The first is a sequence editor aimed at the molecular analysis of nucleotide and amino acid sequences, the second is a database (plain text) editor designed for editing and manipulating database files (plain files in FASTA) TIME SEQUENCE EDITOR: This tool is an implementation of the previously introduced TIME editor [53]. In particular, TIME permits editing, the display and analysis of sequences up to 2 x 10 9 bases (two gigabases or amino acids), which will suffice for the largest chromosomes known to date [54]. TIME functions are organized in a menu Layout: The TIME editor is launched in the main desktop of GPRO. It is divided into six sections (Figure 4). 9

14 Figure 4. Screenshot of TIME using the genome sequence of HTLV-1 [55], available in databases including GenBank [5] and EMBL [56], as an example. The window is divided into six regions as described in the text. 1. Hybrid menu bar and tool bar for easy access to all of the main commands and functions. 2. Tab list of all open sequences and files allowing simultaneous editing of many sequences. 3. Main area: divided into two parts, described in points 4 and Sequence frame: the currently edited sequence is shown. 5. Results export toolbar: reports of each search operation can be saved in multiple formats. 6. Results table: a complete list of results from the last search operation TIME menu: TIME has its own commands and utilities, organized in a horizontal menu (Figure 5), and explained in subsequent text following the numeration provided in figure. Figure 5. TIME menu 10

15 File: Clicking on this button creates, opens, saves and closes amino acid and nucleotide sequence files (only in FASTA format), and allows the user to quit TIME. If the chosen file is a database containing multiple sequences, TIME will open a tab with a summary of all available sequences. Double clicking on a sequence detailed in the summary causes the tool to edit it in a new window. Note that TIME is capable of editing as many sequences as FASTA entries in a database file. In addition, by clicking on either "New nucleotide sequence" or "New peptide sequence", TIME offers the option to create new sequences. In this task, clicking on the sequence frame will change the cursor to a blinking caret, allowing users to type a new sequence. Alternatively, users can paste the sequence text into an empty frame Edit: This tab contains the cut, copy and paste commands. These can be used between sequences to remove, copy and insert fragments. Users can undo and redo each individual change using the Undo and Redo commands. Sequences are locked by default when they are opened. This means, that until you unlock them, no changes are allowed Translate: To translate a nucleotide sequence into the corresponding amino acid sequence. Double-stranded sequences can be translated to all six reading frames, whereas only the three forward frames are available for single-stranded sequences. The standard genetic code is used by default, but you can edit, load and save a code of your own. You can choose to view or hide frames one by one, or all available reading frames at once using the top checkbox (Figure 6). By default, TIME uses the standard genetic code [57] but users can modify the code setting if they wish to employ non-canonical codes (Figure 6, to the right and below). This is useful, for example, when working with mitochondrial codes [58,59] or to specify alternative start codons, as is the case with some bacterial species [60-62]. Clicking on Edit beside Custom genetic code will take you to the genetic code editor. In addition to editing the translation codons, users can rename it, save it to a file or open a previously saved custom code. 11

16 Figure 6. TIME editor screenshot demonstrating an amino acid translation dialog and a pop-up to modify the default genetic code. The translate utility of TIME can also open Gene Runner s translation table format (.trt files) and a native plain text format, which can be easily created in the editor of your choice. You can see an example of this genetic code format in Table 1. Lines starting with the hash symbol ( # ) are interpreted as comments and ignored, with the exception of the first line, which holds the name of the code; however, this line is not mandatory. Each following line is formed by a codon (RNA and DNA are allowed), a hyphen and a greater than symbol ( -> ) followed by an amino acid symbol (according to the 1-letter IUPAC codes [63]). Start and stop codons are marked by the words start and stop after the amino acid. The default colors of the start and stop codons are blue and red, respectively, but they may be changed with the color palette after clicking on each of the colored buttons. Table 1. Standard genetic code examples GAG -> E # Stop codon UAA -> * stop CUG -> L AGU -> S GCA -> A CCU -> P CAG -> Q # Stop codon UGA -> * stop 12

17 Tables 2 and 3 encompass all the nucleotide and amino acid symbols recognized by the TIME editor (corresponding to the 1-letter IUPAC codes); ambiguous symbols are included. Table 2. Valid nucleotide Symbols Symbol Nucleotide base A Adenine C Cytosine G Guanine T/U Thymine/Uracil R Guanine or adenine (Purine) Y Thymine/Uracil or cytosine (Pyrimidine) K Guanine or thymine/uracil (Keto) M Adenine or cytosine (Amino) S Guanine or cytosine (Strong bonds) W Adenine or thymine/uracil (Weak bonds) B Guanine, thymine/uracil or cytosine (All but adenine) V Guanine, cytosine or adenine (All but thymine/uracil) D Guanine, adenine or thymine/uracil (All but cytosine) H Adenine, cytosine or thymine/uracil (All but guanine) N Any base - (hyphen) Gap Table 3. Valid amino acid Symbols Symbol Amino acid A Alanine R Arginine N Asparagine D Aspartic acid C Cysteine E Glutamic acid Q Glutamine G Glycine H Histidine I Isoleucine L Leucine K Lysine M Methionine F Phenylalanine O Pyrrolysine P Proline S Serine T Threonine U Selenocysteine V Valine W Tryptophan Y Tyrosine B Asparagine or aspartic acid Z Glutamine or Glutamic acid J Leucine or Isoleucine X Any amino acid * (asterisk) Stop codon mark - (hyphen) Gap Orientation and Geometry: Allows the user to change from DNA to RNA and vice versa, and to view either RNA or DNA sequences as a single strand or a double strand. It also allows nucleotide sequence orientation to be switched among 13

the following options: reverse, complementary and reverse-complementary. In the TIME editor, nucleotide sequences are displayed in the 5 to 3 direction.

18 the following options: reverse, complementary and reverse-complementary. In the TIME editor, nucleotide sequences are displayed in the 5 to 3 direction. With the Orientation command you can choose to switch to view the reverse, complementary, or reverse complementary sequence. Geometry is used to switch to the transcript RNA of your sequence. There is also the option to view RNA or DNA as a single or double strand Find ORFs: With this command you can search and retrieve ORFs in forward and reverse frames, and a minimum ORF length can be stipulated in the dialog window. The Find ORFs command can check both orientations of the translated sequence for the presence of ORFs. You can optionally specify a minimum length (Figure 7). To search for all frames the user should click on the OK button without writing any number. ORFs reports and their coordinates are summarized in the Results table of the editor (Figure 7). By double-clicking on the appropriate row of the Results table the selected ORF is highlighted in the edit. Reported ORFs can be exported as a FASTA database file or as an annotation CSV file. Figure 7. Find ORFs rationale. This allows the user to specify (or not) an ORF minimum length. ORFs reports and their coordinates are summarized in the Results table of the editor (to the right). To find a reported ORF over sequence, double-click on the appropriate row of the Results table and the selected ORF will be highlighted in the edit Find motifs: Users can search one or more edited sequences to retrieve single pattern and motif occurrences, and for clusters of motifs. When clicking on Find motifs, a dialog for performing quick, single motif searches is opened (Figure 8 14

to the left). Amino acid and nucleotide motif searches can be performed. The motif should be typed or pasted and Find Motif selected. The results are presented in a new tab in the results frame.

19 to the left). Amino acid and nucleotide motif searches can be performed. The motif should be typed or pasted and Find Motif selected. The results are presented in a new tab in the results frame. The Find Motif dialog remains open so that new searches can be carried out in quick succession. The Multiple Motif Editor button can be pressed to carry out an advanced search, causing a new dialog window to be opened (Figure 8). Users can use the buttons located under the table to add and remove motifs, or to load them from a FASTA file. To the top-right of this dialog window the user can choose one of two main search modes; Single occurrences, where motifs will be searched independently, or Clustered motifs (if you are interested in motifs grouped in clusters). The latter admits two parameters to specify a minimum length for the cluster and a minimum number of motif hits is required to report a cluster. Figure 9 presents an example of the results obtained searching both ends of a C/D box [64] in clusters. Cluster locations are specified in either strand with DNA + for the forward direction and DNA for the reverse direction. The user can specify if the end and beginning regions of two consecutive clusters can overlap or not. Figure 8. To the left and above, the find motif dialog. By clicking on the tab to the right and below (multiple motif editor), the tool opens a new dialog (image to the right) for adding motifs and performing searches in two ways. These can be as single occurrences or as cluster of motifs (two or more motifs in array). For users working in the field of genetics, the location of specific restriction sites within DNA molecule is central. The multiple motif editor allows specific restriction enzyme sites to be searched by typing the specific recognition nucleotide sequences of the enzyme or, in the case of a large number of enzymes, loading a FASTA file list (enzyme restriction sites can be downloaded from the Rebase web site [65]). Find motifs reports and summarizes the results in the Results table similar to find ORFs. 15

Figure 9. TIME editor and screenshot showing to the right the Results table obtained after searching both ends of a C/D box [64] in clusters. 4.3.2.

20 Figure 9. TIME editor and screenshot showing to the right the Results table obtained after searching both ends of a C/D box [64] in clusters DATABASE EDITOR GPRO includes a database (text) editor for editing and manipulating sequence databases as plain files or for taking notes. Basic sequence manipulations, such as copying/pasting sequences between documents or cutting sequence regions, can be done by positioning the blinking caret on the text box and clicking the right button of the mouse to choose the preferred option. When invoking the database a new, empty project sheet is opened by default. Protein and DNA/RNA sequences can be pasted and the file saved. In addition, any already existing database file can be opened, edited and closed (the mouse can be used to drag a file from the directory to the editor). As shown in Figure 10, this editor offers an intuitive graphical interface, making it easy to manage. Users can write any kind of text but it is preferable to follow the FASTA format as other GPRO functions only accept FASTA. Contents are displayed in the main desktop, while a list of the names of all covered sequences can be viewed on the FASTA explorer available to the right of the main desktop, in order to control certain tasks of the database management. 16

Figure 10. Database editor screenshot using the CORES database distributed by GyDB, as an example. (1) The main desktop shows all sequences of the database file under editing.

By clicking on any of these in the Fasta Explorer the editor is placed in the selected sequence.

21 Figure 10. Database editor screenshot using the CORES database distributed by GyDB, as an example. (1) The main desktop shows all sequences of the database file under editing. (2) To the right, the editor provides a window implement called Fasta Explorer which lists all FASTA labels in the database. By clicking on any of these in the Fasta Explorer the editor is placed in the selected sequence. The Fasta Explorer has other utilities related to data-mining functions and the association of database files with the worksheet management system of GPRO (these utilities are discussed below). The distinct utilities of the database editor are organized into a menu bar at the top of the Editor, having four commands (Figure 11). Figure 11. Database editor menu File: to open, save and close database files. Alternatively, users can open files in the database editor by double clicking (or right clicking and selecting the option) a selected database file within the Directory Search and replace: The content of large database files is usually complex in terminology, and it is increasingly common for users to require a tool for managing it. In the case of databases containing wrongly annotated sequences or whose names need to be edited, the database editor provides "Search" and "Search and replace" 17

22 utilities (Figure 12). In both cases, the editor allows three options ("Exact term", "Case sensitive" and "Regular expression"). The first option permits the user to search the names of the sequences for the exact term entered. The second (case sensitive) allows the user to distinguish uppercases from lowercases in the search. The third option (regular expression) considers particular characters, words or patterns written in formal language [66]; this button allows users to identify parts that match the specification provided. Figure 12. Search and Replace dialog windows Export and Remove: To export and/or remove selected sequences from a database file to another file, which can be a new file created in the Directory or an existing file to which the user wishes to add sequences. Clicking on Export and Remove causes a new window to open (Figure 13) with a summary of all sequences. This option offers the possibility of sequence selection depending on the presence of certain terms in a database. You must enter a key term in the Search text box and choose one of the following Filter options: "Exact match", "Case sensitive" and "Regular expression". The fourth option - Append selections - performs a new selection using a different key term, adding this result to a previous selection. If the 18

tool finds any sequence labeled according to the search it will be highlighted in the summary. Finally, the tool allows the export or removal of selected sequences from the database file. Figure 13.

23 tool finds any sequence labeled according to the search it will be highlighted in the summary. Finally, the tool allows the export or removal of selected sequences from the database file. Figure 13. Export and remove dialog. The selected sequences (searched by the term MOV ) that will be exported to an output file are highlighted in blue Find Motifs: To search nucleotide and amino acid sequence patterns (motifs) in a database file by parsing all sequences in the file and using exact match, case sensitive or regular expression options (Figure 14). You can use different options as filter. The number of motifs found is shown in a new dialog window, while matched sequences will be highlighted yellow in the Fasta Explorer. Figure 14. Find motifs Found motifs can be exported to a new file. Alternatively, you can sequence-tosequence navigate the file to inspect and retrieve the motifs which remain selected in the editor desktop. In this task, you can use the Fasta Explorer as a shortcut to 19

access the sequences of interest. Sequences displaying the selected motif remain emphasized in yellow within the Fasta Explorer. 4.

24 access the sequences of interest. Sequences displaying the selected motif remain emphasized in yellow within the Fasta Explorer Omic analyses BLAST and HMM homology searches have been predominantly used to map the function of novel described sequences, either nucleotides or amino acids. In fact, the best hit using a BLAST search is sufficient to establish a relationship of homology between query and subject. Quality of annotation depends on the quality of the database used as the subject in the homology search. GPRO employs a pipeline hosted at GyDB for running BLAST and HMM searches, and for processing their outputs. To run these analyses you must be correctly connected to the Pipeline (for more details see Section 4.7.1). The general procedure for working with these functions is simple. Drag your files with the mouse from the FTP Desktop to the appropriate text box of the server you want to use. The pipeline interface and logical working flow is organized into distinct sections as described in the text FORMAT DATABASES: To create and format BLAST databases. The steps to follow are described in Figure 15. Figure 15. Procedure for providing BLAST format to a database. (1) Upload the nucleotide or amino acid database file (provided in FASTA) from the Directory to the FTP using the mouse. (2) Drag the FASTA file you want to format an output folder (a new folder or an existing one) (point 3) from the FTP to the "Format databases interface. (4) Name the database and select the type of database (nucleotide or amino acid). (5) Press Compile database. (6) The tool will automatically generate three binary files (.phr;.pin;.psq) within the output folder. These three files constitute the formatted database recognized by the BLAST package compiled in GPRO. 20

25 The official web site of GPRO ( includes a wide summary of URLs for retrieving and downloading RefSeq databases and genome projects from the most representative scientific repositories and institutions. To our knowledge this is a reference of all publicly available resources but the choice of using a database subject (or other) depends on your research objective and your working preferences. These databases do not belong to us. Therefore, it is important to read their privacy policies and to cite them together with GPRO when publishing. For searching and mapping virus and MGE candidates, GPRO gives the option of using the RefSeq databases maintained and distributed by the GyDB project as subjects by default. Users do not need to format these databases as they are precompiled. However, GyDB databases only include sequences classified in the last GyDB release. Therefore, if a user is characterizing the mobilome (complete set of MGEs) of a genome, or other MGEs not yet classified at GyDB, it is recommended that the research is complemented by using other databases (see the distinct URLs available in the official web site of GPRO) BLAST ANALYSES: The pipeline compiles the last release of the NCBI- BLAST package [33]. For running a BLAST search the candidate genes (or proteins) should be collected in a database FASTA file. Upload the file to your pipeline account via the FTP in the same way as formatting database files, addressed in the previous section, and use it as a query against a subject database via the "BLAST analyses" dialog (Figure 16). Depending on the "query-subject" comparison, the dialog provides four BLAST program options to run the search as detailed in Table 4. Table 4. Search programs implemented in the NCBI-BLAST package BLASTP BLASTX BLASTN TBLASTN Identify protein queries or find protein sequence homologs in a protein database Find similar proteins to translated DNA queries in a protein database Identify DNA queries or find DNA sequences similar to the queries Find similar sequences to protein queries in a nucleotide database Searches performed on the basis of a query file, with a certain number of candidate sequences, are usually computationally intense. GPRO performs the distinct search jobs in remote unattended mode. This means that the user can launch the search and keep it running despite quitting the pipeline and closing GPRO. The BLAST dialog includes a text box for entering your address to receive notification of when the job is complete. 21

Figure 16. Procedure for launching a BLAST search using as an example a query file MOV.txt containing distinct plant virus movement protein sequences.

26 Figure 16. Procedure for launching a BLAST search using as an example a query file MOV.txt containing distinct plant virus movement protein sequences. (1) Upload the FASTA file from the Directory to the FTP. (2) Drag the query file from the FTP to the "BLAST analyses" dialog using the mouse. Select the subject database to be searched. (3) Drag the folder containing your previously formatted databases and (point 4) the output folder (a new folder or an existing one) selected by you, from the FTP to the BLAST analyses dialogs. The database folder dialog will list the distinct databases available and you need to select that of interest. Note that if you are searching MGEs, you can use the RefSeq databases distributed by GyDB. (5) Select a BLAST program (see Table 4) and distinct options (such as E-value cut-off value) to filter your search. (6) Enter your address to receive notification when the job is complete. (7) Run the analysis. You can wait until a finish message appears on the resume log text box or close this window. The pipeline jobs will run in background mode even if your computer is turned off. (8) When the job is complete, the pipeline will report a number of XML output files with your results in the output folder. (9) Gives an idealized depiction of an example of a BLAST result. For processing the information of the XML files see Section

4.4.3. HMM ANALYSES: The pipeline compiles the last HMMER release [34] for performing HMM searches by selecting a specific HMMER program (Table 5).

27 HMM ANALYSES: The pipeline compiles the last HMMER release [34] for performing HMM searches by selecting a specific HMMER program (Table 5). HMM databases used as a comparative subject do not require formatting. The user uploads a query file and a subject database in the FTP and places them in two different folders to launch the analysis as described in Figure 17. Table 5. Searching programs implemented in HMMER HMMPFAM HMMSEARCH FASTA database file queries to an HMM profile database HMM profiles to a FASTA sequence database Figure 17. Procedure for launching an HMM analysis using as example a query file MOV.txt containing distinct plant virus movement protein sequences. (1) Upload the FASTA file from the Directory to the FTP. (2) Drag the query file from the FTP to the "HMM analyses" dialog with the mouse. (3) Select a program option (see Table 5 above); in the example a HMM folder has been dropped from the FTP into the corresponding HMMPFAM dialog. Alternatively, users can import a precompiled database by clicking on the available link. (4) Choose the output folder (a new folder or an existing one) and type an E-value cutoff value to filter your search. (5) Enter your address to receive notification of the conclusion of the job (the analysis can be computationally intense and you may quit GPRO before finishing the job). (6) Run the HMM search. You can wait until a finish message appears on the resume log text box or close this window. The pipeline job will run in background mode even if your computer is turned off. (7) When the job is complete, the HMM program will report a number of txt output files with your results in the output folder. (8) This depiction gives an example of a HMM analysis result. For processing the information of the TXT files see Section

28 There are a number of HMM databases that can be used as a subject. The distinct URLs, a reference of which HMM databases can be downloaded from public repositories, are available in the official web site of GPRO. The choice of a database depends of the user s objectives and preferences. By default, the pipeline facilitates direct comparison to the GyDB-HMM or the CORES databases distributed by GyDB (these are pre-compiled). The GyDB-HMM database is a collection of HMMs constructed on the basis of protein domain and MGE or virus lineage classified at GyDB. The CORES database is the sequence RefSeq database of all protein domains of these MGEs and viruses. In the first case, the user can use a sequence database file as the query and the HMM database as the subject to perform the search using the HMMPFAM program. In the second case, the users can compare their own HMMs against the CORES database by performing the search with the HMMSEARCH program PROCESSING/RETRIEVING RESULTS FROM BLAST AND HMM OUTPUTS: Once the BLAST or HMM search is complete, the pipeline deposits the results in distinct XML files (in the case of BLAST) and/or plain files (in the case of the HMM search) within your output folder (see Sections and 4.4.3). These files must be processed to export the information obtained in the previous analyses. The pipeline commands called "Process BLAST outputs" and "Process HMM outputs" perform these tasks as demonstrated in Figure 18 and 19, respectively. Process BLAST outputs: there are three options for processing the output XML files generated by BLAST (Figure 18). The first retrieves and exports the mapping results from the output file to a CSV file, which can be opened as a worksheet in GPRO. Users can apply a filter of similarity before running the script to give an e-value cut-off to determine a significant threshold. In doing so, the CSV result file will contain those sequences that displayed significant e-value hits in reference to the assigned threshold. The second option generates an additional FASTA database file with the FASTA header of the query sequences labeled with the mapping results provided by the best BLAST hit. Similarly, the third option generates an additional FASTA database file with the subject sequences labeled with the information available in the FASTA header of the queries. In both cases, there are three additional export options: first, you can export the full-length query or subject sequences to the new database file; second, export only sequence cores that are similar in the analysis; third, this core is extended to include as many residues or nucleotides upstream/downstream or N-ter/C-ter as desired. 24

29 Figure 18. Processing mapping results from XML outputs reported by BLAST. (1) Drop the folder with the mapping XML files reported by the BLAST search from the FTP to the BLAST result dialog (this folder is now the input). (2) Drag an output folder to deposit the processed data extracted from the XMLs to the Output folder dialog of the interface. (3) Select one of the options (A, B or C in figure) for processing the XML files and give an e-value cut-off to determine a significant threshold. Then, choose the number of best hits you want and if appropriate, select the option for filtering redundant matches by position. Note that if you select the option A "create a CSV with mapping data" the pipeline will report a CSV with all the information extracted from the XMLs. If you choose the options B or C you will have access to (4) an additional dialog that appears to the left, giving you the possibility of additionally creating a new FASTA database labeled with mapping information. (5) Click on Run to start the data processing. Option B allows you to re-label the FASTA header of all sequences contained in the database employed as the query with the information retrieved from mapping (including detected homologs and function). To do this drag the query file to the box provided in the FASTA retrieval options interface. In this task you have two possibilities for retrieving the sequences: first, to use the FASTA file you will normally use as the query; second, if the query file is also available in BLAST format in your FTP site format and it is very large, you can use the option "use a compiled database. In addition, you can notify GPRO if you want to export and re-label full sequences or the core of these sequences that detected similarities in the subject database. For instance, if you are using retroelement pol polyproteins to detect and retrieve the reverse transcriptase domains, you can automatically separate this domain for all sequences of your database query and store it in a new database. In this task you can also ask GPRO to retrieve traits of nucleotides (in the case of DNA sequences) or residues (in the case of protein sequences) flanking the core of your sequences by specifying the number of nucleotides or residues you want to add. Option C works in an identical way but instead of using the query file it permits you to retrieve and export the homologs detected in the subject database, re-labeled with the additional information of your queries. 25

Process HMM outputs: The utility process retrieves and exports mapping results from the output file generated by HMMER to a CSV file that can be opened as a worksheet (Figure 19).

30 Process HMM outputs: The utility process retrieves and exports mapping results from the output file generated by HMMER to a CSV file that can be opened as a worksheet (Figure 19). Users can apply a filter of similarity before running the script, giving an e-value cut-off to determine a significance threshold. Figure 19. Procedure for processing mapping results from HMM outputs. (1) Drop the folder where you placed the mapping results (these are txt files) from the FTP to the HMM result dialog (these are now the input folder). (2) Select another folder as an output for depositing the processed results and drop it to the Output folder dialog of the interface. (3) Assign an e-value cut-off filter to determine a significance similarity threshold. Then, choose the number of best hits and if appropriate, select the option for filtering redundant matches by position. (4) Run the script. (5) The pipeline will report a CSV file with a detailed output of the results obtained in the search. An example of CSV is shown at the bottom of the figure Alignment Analyses Multiple alignments are central to analyzing the primary structure (the sequence) of genes and proteins and assigning function and homologies to them [67]. Multiple alignments are useful for constructing HMM profiles, consensus sequences and sequence logos for the identification of conserved motifs that may be characteristic of genes and protein domains. GPRO 1.0 implements commands for constructing these computational tools as described in the subsequent sections. 26

4.5.1. HMMs AND CONSENSUS SEQUENCES: HMM profiles [50] are probabilistic models that capture specific information of the sequence consensus of a set of aligned sequences.

31 HMMs AND CONSENSUS SEQUENCES: HMM profiles [50] are probabilistic models that capture specific information of the sequence consensus of a set of aligned sequences. This utility allows the user to create and calibrate HMM profiles and majority-rule consensus (MRC) from multiple alignment inputs using the easy-touse script that accesses the implementation of HMMER [34] installed in the GPRO pipeline. Figure 20 illustrates the steps to follow for this process. Note that to run the pipeline you must be connected to the internet. Figure 20. Creating HMM profiles and consensus sequences with the HMMER application compiled in the GPRO pipeline. As an example, a query folder containing distinct files was used. Each file contains a multiple alignment based on the protease variant typically encoded by the distinct members of a particular lineage of LTR retroelements. In short, the example illustrates how to create lineage-specific HMMs based on proteases: (1) Drag the query folder from the FTP to the "Alignment folder" text box to create and calibrate HMMs based on the distinct alignment files deposited in the folder (these must be provided in FASTA format). (2) You can create MRC sequences from the input alignments by checking the box called Additional options. (3) Choose an output folder to deposit the created HMMs or MRC sequences into a new folder or an existing one and drag it to an output folder box of the interface. (4) Enter your address to receive notification of completion of the job (the analysis can be computationally intense and you may quit GPRO before finishing the job). (5) Run the script. (6) and (7) demonstrate examples of created HMMs and MRC sequences, respectively. 27

32 SEQUENCE LOGOS: Sequence logos methodology was introduced by Schneider and Stephens [51] as a new method to obtain visualization of alignment features including the general consensus, information content architecture characterization, and frequency of all possible nucleotide and amino acid states per alignment position of a set of sequences using the Shannon algorithm [68] (for more details about the Logos methodology, see [51,69]). GPRO includes an implementation of CheckAlign [70,71], a logo-maker application that allows users to create logo representations from the input of gapped and ungapped alignments using information theory and alignments in FASTA format as inputs. There are options for applying corrections to alignments with a small number of aligned sequences and logo approximations for alignments based on poorly conserved sequences (Figure 21; more details about CheckAlign are available in [70,71]). Figure 21. Screenshot of the logos maker tool based on the implementation of CheckAlign a tool for creating sequence logos. As an example we used an alignment of the Aphid Transcription Factor (ATF) product normally coded by plant caulimoviruses. The alignment has been retrieved from the GyDB collection [72]. The procedure is rather intuitive; upload the alignment (only in FASTA format), select a method and run the analysis. 28

33 4.6. Management This command offers distinct utilities for managing files, folders and contents placed in the Directory. These are useful for rearranging or creating databases, or for performing data mining. The rationale follows. When invoking any "Management" utility, a window interface appears in the main desktop, while the Directory appears to its left for facilitating the dragging of files and folders via the mouse to the distinct available utilities, which have been organized as follows FILES AND FOLDERS: This utility is organized in three sections called - "Join Folders", "Join Files" and "Split Files" Join Folders: To reorganize folders using distinct key terms - as diverse as an enzyme name or an annotator name - as criteria to search sub-folders labeled as such in distinct places of the Directory and to reorganize them together in a new single folder (Figure 22). Figure 22. Join folders. The utility "Files and Folders" implemented in the command Management is divided into three sections - Join folders, Join files and Split files - available at the top of the interface. If you want to reorganize folders (even if they are in other folders within the Directory) use "Join folders" as follows: (1) use the mouse to drag the folder containing all the relevant folders from the Directory to the "input folder" text box of the join folder interface. (2) Type as many word filters as required to reorganize the contents of the directory in folders labeled according to the terms used as word filters. (3) Create an output folder in the Directory and use the mouse to drop it in the output folder section of the interface. (4) Run the script. (5) The tool exports all folders fitting to your search term in the output. In the example, our aim was to reorganize distinct folders with contents concerning retroelement protein domains in new folders labeled according to the acronym of each protein domain (Gag, PR, INT, etc.). 29

34 Join Files: to select and group files in a single file or folder. The tool accepts only FASTA and XML files. Using Join files, users can collect, for example, all the distinct XML files of a BLAST analysis in a single XML file or retrieve common gene features of an annotation database divided into distinct files. The possibilities offered by this tool are the following: Select files and place them in a single folder Select XML files and place them in a single folder Join files in a single file XML files in a single XML file In all cases, it is possible to filter the results using various criteria such as file name, extension and type of sequence (nucleotide or protein). The tool works even if the files are in distinct sub-folders of the selected folder from which you want to retrieve the files. The procedure is similar to that of "Join folders", but instead of working with folders, in this case the tool manages files. Figure 23 provides a screenshot of the Join files process. 30

35 Figure 23. Join Files. This is the second option of "Files and Folders". The utility Join files allows users to re-organize distinct files within the Directory (even if they are within distinct folders of the root folder). The tool manages only FASTA database files and XML files and permits the use of key terms as including or excluding filter options to export all files fitting to this term within a single folder or to create a single database file encompassing the contents of all files fitting the filter option. To illustrate these two utilities, a variety of LTR retroelement sequences annotated from the Pea aphid Acyrthosiphon pisum genome, were used [73]. These annotations are divided into folders (one for each LTR retroelement sequence annotated) each containing two files; one for the nucleotide sequence of the LTR retroelement, and the other with the sequence of its encoded products. In the first example, the protein and nucleotide sequence files of all annotation folders were separated and placed in two folders (one for proteins and the other for nucleotides). To do this: (1) drag your input folder (it can be a folder within the directory) from which you want to retrieve and reorganize contents from the Directory to the "Input folder" box. (2) Check the option "Select files and place them in a single folder" and apply the filter options. (3) Drop the output folder (a new folder or an existing one) and (4) run the script. Alternatively, should you select the option "Join files in a single file" and the tool will collect the contents of all files and will group them in a single database file. In other words, this second option is an easy way to create DNA and protein databases. (5) Shows the results according to the filter stored in the output folder: in the example the distinct nucleotide sequences have been joined in a single folder (A) and in a single file (B). The utility only works with FASTA and XML files. 31

4.6.1.3. Split Files: To split a database file into as many files as the selected criteria.

36 Split Files: To split a database file into as many files as the selected criteria. The selection of the sequences can be performed according to a key term in the name of the sequences (one or more) or by blocks of sequences (to divide the content of a large sequence into two or more files). Figure 24 demonstrates a Split Files example. Figure 24. Split files. This is the third implement available at the top of the "Files and folders" interface within the command management. You can use it for simultaneously exporting contents sharing a reference feature from a database or for dividing a database (if it is excessively large) into blocks of sequences (the number must be specified). To use Split files you must: (1) drag your input file from the Directory into the corresponding text box in the split file interface. (2) Select one of the proposed split options (in this case we used the COREs database distributed by GyDB). This file contains protein sequences with the FASTA header labeled with an acronym describing the protein domain and with the name of its encoding retroelement. We want to extract all aphid transmission factors (label ATF) and all Virus Associated Proteins (VAPs) simultaneously and to export them into two separate files. To do this, type ATF and VAP in the box to the left to the split option and (3) drag an output folder to the related text box and (4) run the script. Doing so, the tool reports three files (5) two of which contain all sequences of the COREs database containing the labels ATF and VAP, respectively, while the third file (no match) includes the remaining sequences of the processed database. 32

4.6.2. FIND SEQUENCES: This is a data mining tool for searching and extracting sequences from FASTA file databases using the labels or names of the sequences in the FASTA header as a search criterion.

37 FIND SEQUENCES: This is a data mining tool for searching and extracting sequences from FASTA file databases using the labels or names of the sequences in the FASTA header as a search criterion. With this utility you can search a database file using one or more term options (exact terms, case sensitive, regular expression, etc.) to export sequences and create new databases as described in Figure 25. Figure 25. Find sequences. Complementary to other management tools such as split files, this utility is a conventional data-mining engine for retrieving sequences from a database file using distinct terms as search criteria. In this task: (1) drag the input file from the Directory to the "Input file" text box. The FASTA headers of all sequences within the file will be listed in the sequence list dialog. (2) Type the label or name of those sequences you want to extract and the filter options (exact term, case sensitive, etc.) in the "Select by term box below. In this example, we used only a search term (MOV9, but you can use two or more terms by typing them separated by commas). (3) Drop an output file to the corresponding dialog. With informative aims, if your file contains any item matching your search terms they will be automatically highlighted in blue within the "Sequence list" dialog (note that to the left of the sequence list you have additional tabs to select all sequences, to deselect them or for reverse the process of selection). From that point, choose if you want overwrite the output file. If you do not make this selection the sequences will be added to the output file without removing previous contents. (4) Click Run to retrieve and export your selection into a new database within the directory. (5) The bottom of the figure shows the exported FASTA file from this example. 33

38 ALIGNMENTS: This command presents two implements for joining DNA/RNA and Protein multiple alignments or for giving them distinct file formats, as described in the two sections below Join alignments: A frequent aim in multiple alignment methodology is to identify the set of motifs common to all sequences with the same known function. This is the most conserved part (core) of a DNA and protein domain. For instance, in the study of MGEs [74] it is usual to find elements harboring ORFs for polyproteins containing two or more protein domains. In these cases, researchers may be interested in evaluating the consensus phylogenetic signal common to these domains. To facilitate this, GPRO includes an implementation of "Join Alignments" [75]. This tool allows the users automatically to join different files containing distinct multiple alignments (one per each domain) into a single alignment within a single file and arrange them in a user-defined order (Figure 26). The tool has two requisites: the number and name of the sequences must be identical in each file to join them, and the alignments must be provided in FASTA format. 34

39 Figure 26. Join alignments. This utility permits you to join files containing distinct multiple alignments (one per each domain) into a single alignment within a single file following any specified order (Figure 26). The tool has two important requisites: the number and name of the sequences must be the same in each file to be joined, and the alignments must be provided in FASTA format. To illustrate the utility we have used different files, each containing a multiple alignment based on the GAG (red), Protease (AP, green), Reverse transcriptase (RT, blue), Ribonuclease H (RNaseH, yellow), Integrase (INT, violet) and Envelope (ENV, orange) proteins encoded by distinct Retroviridae retroviruses. We will join these six alignments in a single gag-pol alignment, organized as described in the figure. To do this: (1) drag the six alignment files to be joined from the Directory to the corresponding text box dialog. This dialog lists the distinct files defining the order in which the alignments will be joined. This order can be modified (or removed) using the commands at the right "move up" or "move down". (2) Drop an output file into the corresponding dialog and (3) run the script. (4) If the number and name of the sequences are identical, the tool will successfully join the sequences and report a single FASTA alignment with all gag-pol domains joined in the specified order for each common name. 35

40 Format alignments: This option is an implementation of the Alignment Format Converter [76], a tool that allows users to upload a protein or nucleotide multiple alignment file in one format and convert it into other formats in one step. The utility accepts and converts the following formats: FASTA, Clustal, Pir, MSF, Phylip and Stockholm. Any of the formats can be used for input and output, and can be selected in one, several or all formats, simultaneously. Figure 27 presents a scheme of the procedure. Figure 27. Procedure for formatting multiple distinct alignment files simultaneously, using an alignment with the GAG polyprotein Retroviridae sequences as an example. (1) Drag the file to be processed from the Directory to the input text box. (2) Select the format of the input alignment and the format you wish to change it to (aln, msf, phy, pir, sto). (3) Drop an output folder into the corresponding dialog and (4) run the script. (5) The formatted files will be exported to the output folder. 36

This section allows default preference settings to be selected for Pipeline connection and the default fonts for the worksheets. 4.7.1.

The user must provide the personal connection credentials given at registration of the GPRO application.

41 This section allows default preference settings to be selected for Pipeline connection and the default fonts for the worksheets PIPELINE CONNECTION SETTINGS: This is a dialog for configuring the Pipeline connection for users to access remote accounts on the cluster (Figure 28). The user must provide the personal connection credentials given at registration of the GPRO application. Once all connection data have been entered, click on Test connection settings to check if the settings are correct. If so, the FTP Desktop view will be accessible by clicking on Omic Analyses tab (Section 4.4.) in the main menu. Figure 28. Pipeline connection settings WORKSHEET PREFERENCES: This is a dialog for configuring specific preferences for annotation worksheets (Figure 29). It allows the user to set default font settings to be applied to all annotation worksheets (for more details concerning the GPRO Worksheet see below). Figure 29. Worksheet preferences dialog 37

4.8. Help The Help button provides legal information concerning GPRO, access to this manual and to the "suggestions and bugs tab implemented in the tool for reporting incidences and providing

42 4.8. Help The Help button provides legal information concerning GPRO, access to this manual and to the "suggestions and bugs tab implemented in the tool for reporting incidences and providing suggestions. This section allows us to receive users feedback in order to implement new functions and commands adapted to the needs of GPRO users. 5. WORKSHEET ANNOTATION SYSTEM GPRO allows users to deal with annotation of multiple gene candidates and other genomic features using a worksheet annotation management system based on a grid of editable cells arranged in numbered rows and columns (Figure 30). This template offers the possibility of opening CSV files created by the user or obtained as a result of a previous analysis (such as Process BLAST outputs ; Figure 30). Figure 30. Worksheet screenshot. Example of omic project stored in a CSV opened using the worksheet data management system of GPRO. (1) This is the worksheet menu bar and commands. (2) Column headers; in this example the order is: Sequence, Subject, Score, e-value, Query from, Query to, Hit from, Hit to, Identities, Positives and Function. (3) Grid of cells with numbered rows and columns, which can be selected by clicking on the corresponding checkboxes. In this example, rows display data from several sequences obtained from a previously annotated metagenome file. (4) Information concerning availability of a link between the worksheet and FASTA database. 38

43 The worksheet is launched in the main GPRO desktop and implements a wide variety of functions to: (a) create and remove rows and columns; (b) import, export and combine databases on the basis of a selection of rows and/or columns; (c) add, search and replace annotation terms based on commonly used taxonomies and vocabularies; (d) organize and color the cells according to key terms (mapping, annotation, function, statistics, etc.). In addition, the worksheet can be linked to a sequence FASTA database. Therefore, when the user changes something in the worksheet, this change is simultaneously added in the database if it includes the information modified in the worksheet. Worksheets can be opened in a variety of ways: (a) as a new empty project or as an already existing project selected from the command "Databases available in the main menu; (b) from a CSV file provided by the user (for instance, an excel document saved as CSV) in the directory by using the mouse; (c) from the File command available in the worksheet menu (see Section 5.1). By default, a new worksheet implements a number of columns summarized in Table 6 but users can add/remove columns and rows by right-bottom clicking with the mouse in any place on the worksheet. Sequence Sequence Size Subject Subject Size Accession Species Score E-value Query-from Query-to Subject-from Subject-to Query frame Subject frame Identities Positives Hsp/Query Hsp/Hit Comments Table 6. Worksheet default columns Your sequence name Length of the query sequence The database subject mapped using your sequence as query Length of the mapped subject The accession number or identifier of the mapped subject. Host species Scoring for the alignment between query and subject (the HSP). Statistics associated with the alignment between query and subject. The interpretation is that the lower the E value, the more significant the score, with the exception of the case of the perfect hit (one sequence against itself), where BLAST usually assigns the E-value "0" by default. Query sequence start position between two matched sequences Query sequence end position between two matched sequences Subject sequence start position between two matched sequences Subject sequence start position between two matched sequences Frame of the query Frame of the subject Degree to which two (nucleotide or amino acid) sequences are invariant. Number and fraction of residues for which the alignment scores have positive values Coverage between the HSP and the query Coverage between the HSP and the subject To take notes about this sequence Users can move a column from one position to another by left-bottom selecting and dragging it with the mouse (see Section 6.4). A description of the worksheet menu and its implements in terms of managing an omic project is set out below. 39

44 5.1. File This command provides a drop-down command with five utilities allowing the user to open and save a worksheet project, to select a default file, to edit and to export a new worksheet (based on a user-defined selection or on previously checked rows and columns). A brief description of each utility follows OPEN WORKSHEET: To launch pre-existing CSV files as worksheets SAVE: To save the changes in an open worksheet within the same CSV SAVE AS: To save a worksheet as a CSV SET AS DEFAULT WORKSHEET: To fix a particular CSV file as the default worksheet. In doing so, the selected CSV will be opened automatically when launching GPRO EXPORT: To export a set of annotated candidate genes, the worksheet allows you to perform a variety of selections on the basis of distinct row/column criteria (function, host, E-value, ontology, etc.) described in the next sections of this manual. When you are ready to export results, click on the Export option in the File command of the worksheet and choose one of three possibilities - "Export CSV & FASTA", "Export Annotation file" and Export categories & clusters Export CSV & FASTA: This option provides the possibility to export results in three ways: first, as a new worksheet; second, as a new worksheet accompanied by a FASTA database (with the sequences FASTA headers labeled according to annotation terms); third, as a FASTA database with the annotated sequences. Figure 31 presents a graphical description of these three possibilities. In the second and third cases, exporting a FASTA database annotation using the worksheet as a manager requires you to have the sequence database you want to re-label previously associated with the worksheet. To learn how to do this, see section 5.8 "Associate database". 40

45 Figure 31. Exporting data. You can use the worksheet to edit/modify information from omic projects stored in CSV files and their associated sequence databases (if there is a link between worksheet and FASTA database) by removing/adding rows and columns, editing contents or adding new information. To save the changes over the same worksheet you need to click on the options Save or Save as available in the File worksheet command. However, should you want to save modifications while preserving all the previous worksheet information in the original CSV, you can export your new amends to a new CSV by selecting the path Export > Export CSV & FASTA within the file command. You have three exporting options: (A) the option Worksheet & fasta lets you export a worksheet coupled with a FASTA database if there is a previous worksheet-database association (to do this see section 5.8). (B) The option Worksheet only exports the edited worksheet to a new CSV. (C) The option FASTA only exports a FASTA database (there must be a previous association between a worksheet and a sequence database). In all exporting cases, it is required that you check rows and columns for exporting the data (step indicated with hand icons) Exporting annotations: This function is for exporting annotated sequences (rows) from the worksheet using one or more columns as the reference for the annotation records and other columns as records features. The exported output is a plain text file (named Annotation in all cases) with the annotation information usually organized in pairs of lines for each worksheet row (i.e. the annotated sequences) except for those that share header information. 41

46 As shown in the example below, the first pair is the annotation header and provides information about the organization of the annotation in the file. Reference=Query def Function Hit def Hit accession Score e-value Query from The first line of the head indicates which columns (separated by bars) have been selected as the annotation references. In the example above, these are "Query_def" and "Function". The second line sets the order assigned to the distinct columns (separated by bars) you select as the subject. In the example above, distinct columns are called "Hit def", "Hit accession", "Score", "e-value" and "Query from". The remaining pairs correspond to the distinct sequences annotated (one pair for each sequence) according to the header organization. Four examples of annotation follow. Reference=contig00720gene_4 stage iii sporulation protein j precursor lin ,06E Reference=contig00745gene_4 nitrate reductase beta chain SA Reference=contig00667gene_92 general stress protein 13 SA ,34E Reference=contig00667gene_91 peptidylprolyl isomerase SA ,09E If some rows reveal a share of the header (for instance, duplicated or related ORFs within the same contig or scaffold), they will be grouped into a cluster with as many lines as there are rows sharing the header. See the example below. Reference=contig00667gene_89 monovalent cation h+ antiporter SA0813_ SA ,12E-54 1 SA ,02E-49 1 SA Reference=contig00667gene_70 membrane protein SA ,67E SA SA ,95E

Figure 32 presents a graphical description of this process and the format with which the data are annotated in the output file. Figure 32. Exporting annotations.

47 Figure 32 presents a graphical description of this process and the format with which the data are annotated in the output file. Figure 32. Exporting annotations. If your omic project is complete and you want to export your annotation results, the option "Export annotation file" lets you prepare and export the annotation via the worksheet. Select the path Export > Export annotation file available within the File worksheet command. This action will open a new window to select the exporting preferences. The format of annotation followed by GPRO is a summary of records organized into pairs of lines for each annotated row except for that sharing header information. The first line of the output file is the item header, created by selecting one or more reference columns to refer to each item in the annotation. To make the header selection, move any column you want to use as header (they can be one, two or more) from the dialog list called Reference columns to the adjacent area using the transferring arrow between these two dialogs. You can the reorganize the order of reference columns in the annotation header using the vertical arrows. Select the columns you want export as annotation features and move them from the dialog list Export columns to its adjacent. The procedure is identical to that for the Reference columns but in this case you have the additional option of joining the information from two columns into a single one. You can select the type of field separator you want to apply to separate columns in each annotation item (by default, a vertical bar). Finally, the program presents you with a preview of the exporting format at the bottom of the window. If this is correct, press OK for running the automatic annotation export Exporting rows in categories and clusters: This implement allows the user to export sequence rows as categories (one file per category) or clusters (sequence pools created on the basis of common features and exported to a single file). The 43

selection is performed, taking into primary consideration a term repeated in, or common to, distinct rows within the column you select as a reference (function, clades, host species, etc.).

48 selection is performed, taking into primary consideration a term repeated in, or common to, distinct rows within the column you select as a reference (function, clades, host species, etc.). Figure 33 illustrates the process to be followed in order to export rows by categories. Figure 33. Exporting rows in categories. This implement allows you to export sequence rows by categories in distinct files (one file per category) on the basis of common features in a column selected as a reference as follows. (1) Check the rows (the output units) and columns (the information associated to each row) you want to export and follow the path Export > Categories & clusters available in the File command of the worksheet menu. A windows dialog will appear. (2) Select a destination folder where the generated files will be stored and the worksheet column you want to search for common terms. In addition, a summary of the distinct worksheet columns is available for you to add or remove columns. (3) Check the exporting option (in this case Categories ) you want to use and press OK for running the utility. (4) The output folder Export Categories will contain all csv files generated by this tool that were divided into distinct files according to the recurrence of common terms in the searched column called function. (5) An example of a generated file displaying the terms grouped by the same function category (framed in red in the input worksheet). 44

The process for exporting clusters of related rows within a single file is almost identical to that previously described in Figure 33 for exporting categories. Figure 34 illustrates the procedure.

49 The process for exporting clusters of related rows within a single file is almost identical to that previously described in Figure 33 for exporting categories. Figure 34 illustrates the procedure. If the worksheet in use is associated with a database, the Export categories & cluster function will provide the corresponding FASTA files for the sequences of the categories and clusters obtained. Figure 34. Exporting rows in clusters. You can export sequence pools as clusters of rows created on the basis of common features in a column selected as a reference. The procedure is the same as that shown for exporting categories. First, (1) check the rows and columns you want to export and follow the path File > Export > Categories & clusters in the worksheet menu. (2) Select a destination folder to deposit the output file into and the key column in the worksheet you want to search for common terms. (3) Check the exporting option (in this case Clusters ) you want to use and press OK to run the utility. (4) An example of a Clusters file where sequences with the same function were clustered (framed in red in both Clusters and worksheet files). 45

50 5.2. Search and replace This function works in an identical way to that previously noted in section but applies the search and replacement of labels in the Worksheet. Clicking on this option, a dialog is opened for you to choose between two utilities - "Search" or "Replace" Sorting/Filtering Worksheet contents can be organized according to the ascending or descending order established in a column. Select the column of reference. Choose the type of data for this column (text or numerical data), and decide the criterion for ordering. The whole worksheet will be rearranged according to your choice Import This function allows users to join two or more omic projects through three different options - Append worksheet, Combine worksheets and Clusters - as described below APPEND WORKSHEET: This tool is for the export the rows of worksheets to another worksheet used as the Master file (Figure 35). To carry this out successfully the rows of both worksheets must be associated with columns with identical names COMBINE WORKSHEETS: This function is for combining data from two worksheets into a single worksheet using a common column as a join reference (for instance, a sequence name or identifier). The difference between this tool and the previous utility "Append worksheet" is that while the latter is designed to add new rows to a master worksheet (i.e. new sequences to annotate or to classify in your database project), "combine worksheets" adds new columns with new information (taxonomy, ontology, etc.) to the worksheet with a pool of already analyzed sequences using other complementary topics. The utility is useful for joining results of two comparative analyses performed via two independent BLAST searches between a query database file and two different Refseq databases. Figure 36 presents the steps required to run this utility CLUSTERS: If you previously identified homology relationships (for example, paralogs repeats or MGEs related to one another) among the different sequences of your project, this utility allows you to add this information to the worksheet. As shown in Figure 37, create a CSV file containing as many columns as names of related 46

51 sequences (the members of an established cluster) and as many rows as clusters of homologs. Open the worksheet where you are managing the annotation and use the "Clusters" utility to append the information. GPRO will add a new column detailing the distinct homologues identified per sequence. Figure 35. Append worksheet. This utility is for exporting the rows of a worksheet into another used as a master file, where the rows of the imported worksheet will be saved. To carry this out successfully the rows of both worksheets must be associated with columns with identical names. In this task, open the CSV you want to use as the master worksheet (framed by a red outline in the figure). Go to import > Append worksheet. GPRO will open a window in which you can browse the worksheet from which you import the data using the Import tab of the dialog. The grid of this dialog shows the name of the columns available in each worksheet. If the names are accompanied by a green icon in the section status, the column names are identical and you can proceed to join both worksheets. If any status icon appears in red please revise the names of the two worksheets to make them to coincide before running the utility. 47

52 Figure 36. Combine worksheets. This function allows you to export columns from a worksheet to another using a column that is common to both worksheets as a join reference (identical row labels or terms, for instance sequence names). To make worksheet combinations, select the option combine worksheets within the command Import of the worksheet menu to open the windows dialog of this function. (1) Browse the CSV files you want to use as master and related worksheets, respectively, using their corresponding browsers (the new columns will be appended in the master worksheet). (2) The master and the related worksheet sections (to the left and to the right, respectively) present a dropdown dialog called "Key Column" for you to select the column common to both worksheets (remember that this column must contain identical labels). (3) Below this you have a list presenting the distinct columns of each file you want to join. Check the columns you want to combine for each worksheet, and (4) browse an output CSV to save the new project and click OK to run the function. 48

Figure 37. Import clusters. This utility is useful for automatically adding a new column providing known information about common relationships (function, taxonomy, paralogs, repeats, MGEs, etc.

53 Figure 37. Import clusters. This utility is useful for automatically adding a new column providing known information about common relationships (function, taxonomy, paralogs, repeats, MGEs, etc.) among rows in the worksheet. In this task, it is necessary to have (1) a previously created CSV file containing the information distributed in as many columns as names of related sequences (the members of a cluster), and as many rows as clusters (framed in blue in the example). If you have this material, (2) open the CSV you want to use as the master worksheet for adding a new column. Click on the "Import worksheet-menu command and select the Clusters option. (3) A new window dialog will be opened for you to browse the cluster CSV file to be loaded (point 4), and choose the column to be taken as a reference for adding cluster information. (5) Click OK and GPRO will add a new column called Cluster group (framed in red in the example) with the names of all cluster-counterparts for each sequence. 49

54 5.5. Export options This utility allows the user to select which columns to show or hide in the worksheet. You can select the columns you want to show (or export) by manually clicking on the checkbox at the top of each selected column using the mouse Annotation This command provides diverse options for launching the pipeline in order to annotate your database project by appending functional and ontology levels to your previously mapped sequences using the (GO and COG/KOG system) specialized annotation vocabularies described below. This utility allows you to color the rows of the worksheet differentially according to distinct levels of significance (non-significant hits, mapped sequences, annotated sequences, etc.). It also permits users to switch the IDs of their sequences on the basis of distinct classification systems GENE ONTOLOGY (GO): By clicking on the tab "Append GO terms", you can add new columns to the worksheet containing the GO terms, GO IDs, Enzyme Codes (ECs) and InterProScan IDs (Figure 38). The GO terminology [4,77] is a bioinformatic initiative created with the aim of providing a controlled vocabulary (ontology) of terms for describing and annotating gene product data. GO is a component of the Open Biological and Biomedical Ontologies (OBO), created to create vocabularies for shared use across different biological and medical domains [78]. GO covers three domains: cellular component (C), which correspond to the parts of a cell or its extracellular environment; molecular function (F), which collects the elemental activities of a gene product at the molecular level, such as binding or catalysis; and biological process (P), which describes operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs and organisms. The EC is a number assigned to a type of enzyme according to a scheme of standardized enzyme nomenclature found in ENZYME, the enzyme nomenclature database [79], maintained at the ExPASy molecular biology server [80]. InterProScan is an integrated database of predictive protein signatures [7] used for the classification and automatic annotation of proteins and genomes, available at EBI [81]. To read more about the GO initiative, see and for more details about how to install GO databases in GPRO visit the official Web site of the tool at 50

Figure 38. GO annotation. To add new columns containing GO terms to the worksheet: (1) open the window dialog of function Append GO terms available in the worksheet-command called "Annotation".

55 Figure 38. GO annotation. To add new columns containing GO terms to the worksheet: (1) open the window dialog of function Append GO terms available in the worksheet-command called "Annotation". (2) Use the drop-down selectors called "Column name" and "Data type", respectively, to select the worksheet column containing the IDs" of the mapped sequences and the type of IDs, which must be GIs or Uniprot IDs. If you mapped your sequences using GenBank accessions use the function "Switch GI/accession" also available in the command "Annotation" to convert GenBank accessions to GIs (see section 5.6.4). (3) Use the mouse to select the annotation columns you want to append to the worksheet based on the GO system and its related nomenclatures ( GO, EC, InterProScan ) and (4) click OK. If you select the three features, GPRO will add three new columns to the worksheet (framed in red), providing annotation information for each sequence (row). 51

56 COG/KOGS: The Clusters of Orthologous Groups (COGs) of prokaryotic proteins and their (KOGs) eukaryotic counterparts [3,82] collect prokaryotic or eukaryotic proteins into groups containing orthologs of different species (or paralogs derived from duplication of a single gene within a genome). To append COG/KOG terms to your sequences, it is necessary to have mapped them previously to the Refseq COG and KOG databases integrated in the NCBI Conserved Domain Database (CDD) [83], available at the FTP of the NCBI (where the COG and KOG RefSeq databases are called "myva" and "kyva", respectively). This comment is a simplified explanation because in addition to the "myva" and "kyva" databases, the COG/KOG system includes more databases and files you need to install in the GPRO pipeline to annotate sequences on the basis of COG/KOG terms. Please, visit the official Web site of the tool at for more details about how to download and install COG/KOG databases in GPRO. See Section 4 of this manual for how to give a BLAST format to the "myva" and "kyva" databases for BLAST mapping. If you have successfully installed all COG/KOG files in the GPRO pipeline, you can use it to add COG or KOG terms to your worksheet by clicking on the tabs "Append COG terms" (if you are annotating prokaryotic orthologs, as shown in Figure 39) or "Append KOG terms" (if you are dealing with eukaryotic orthologs). GPRO will automatically append two or three new columns ( COG name, "COG function and GI ) to the worksheet as shown in the figure below. To add KOG terms, the process is the same as that described in Figure 39. It is important to stress that on user's recommendations we can implement other biological vocabularies. If you wish other libraries not yet contemplated by GPRO to be included, please contact our technical support Web site to make this request (see Section 4.8 Help ). 52

57 Figure 39. Appending COG or KOG terms to the worksheet. If you have previously used the BLAST search compiled in the pipeline to map the sequences of your omic project using the myva database (prokaryotic orthologs) or the kyva database (eukaryotic orthologs), you can (1) select the utility Append COG terms in the Annotation command of the worksheet-menu in order to add COG annotations to the worksheet by choosing the column of reference ( GI ) and the type of data contained in this column ( gi or Protein names of the mapped myva or kyva subjects provided by the BLAST output) (point 2). (3) Check the boxes below this dialog to choose the COG/KOG terms you want to append to the worksheet by adding two or three new columns (framed in red in the example). This will depend on the checked terms ( COG, COG function, GO GI ). (4) Click OK to run the COG/KOG annotation. If you have not used the BLAST search to the myva (for COGs) or the kyva databases (for KOGs), please see section 5.6 to retrieve the COG/KOGs databases from NCBI and section 4.4 to format these databases in the GPRO pipeline. 53

5.6.3. APPLY ANNOTATION COLORS: This utility is for configuring specific preferences for your worksheet.

58 APPLY ANNOTATION COLORS: This utility is for configuring specific preferences for your worksheet. It allows you to set the colors to be applied to the worksheet's rows according to the following annotation criteria: Non-significant hits: rows containing null or E-values higher than the threshold value specified by the user (in white). Significant hits: rows containing significantly lower e-values than the threshold value specified by the user (in orange). Mapped: rows containing significant hits and GO codes (in green). Annotated: mapped rows that also contain Enzyme Codes (in steel blue). Annotated plus: annotated rows that contain other annotation criteria (in dark goldenrod). Figure 40. Applying annotation colors. (1) Select Apply annotation colors in the Annotation tab and a window dialog is opened (remember, you can change the E-value threshold for significant mapping). (2) The resulting worksheet will display the rows colored according to the criteria of annotation. 54

59 Remove the assigned row colors in the worksheet by clicking on the Apply annotation colors option of the Annotation tab SWITCH GI/ACCESSION: Using this tool the user can use the GPRO pipeline to switch from an Accession Number to its corresponding Gene Identifier (GI) or vice versa. The GI is an identification number for nucleotide and protein sequences, while the accession number represents the database record of a sequence in GenBank [5]. GenBank is a database where nucleotide and protein sequences from more than 260,000 organisms are publicly available thanks to an international collaboration among the NCBI, the European Molecular Biology Laboratory Nucleotide Sequence Database (EMBL-Bank, [84]), and the DNA DataBank of Japan (DDBJ, [85]). As demonstrated in Figure 41, to switch between GIs and accessions, invoke the utility with the same name in the command "Annotation", available in the worksheet menu. A dialog will appear where you must choose the worksheet column name including this type of information. Go to the option Data type in this dialog to select among the following options: protein GI, nucleotide GI, GenBank protein accession or GenBank nucleotide accession. Click OK and the information within the selected worksheet column will be replaced if they are correct. GIs and GenBank accessions are associated with each other, but if you use the wrong terms the tool will fail in switching this type of information. 55

Figure 41. To Switch between GIs/Accessions. (1) Press Select GI/Accession in the Annotation tab and a window dialog is opened. You can change the GI terms to accession numbers or vice versa.

60 Figure 41. To Switch between GIs/Accessions. (1) Press Select GI/Accession in the Annotation tab and a window dialog is opened. You can change the GI terms to accession numbers or vice versa. Select the column with the GI or accession terms you want to process. Specify the type of data the column contains (in the example, GIs) and (2) click OK. The GIs (framed in blue in the example) listed in the column will be replaced by their corresponding accession numbers (framed in red) Selecting rows This command permits users to select specific rows of the worksheet (for export/annotation purposes) by using different selection criteria (terms, colors, etc.) as described in the next section. Any of these selections can be used to export rows using the function "Export" available in the File command of the worksheet BY KEY TERMS: This utility lets you select rows (i.e. Sequences) by performing a selection in any column using specific terms as described in Figure

61 Figure 42. Selecting rows by key terms. (1) Click on the By key terms utility available in the Select command, and a new dialog window will appear. (2) Press the "Add" tab, and enter as many terms as you want to select and/or to assign specific colors. (3) Select the column you want to use as a reference where you want to perform the selection and run the script. A resume dialog will indicate the number of selected rows. (4) As a result, the terms selected in the worksheet will be highlighted with the assigned colors BY EXPECT OR STATISTICS VALUES: This selection can only be performed on columns containing numerical data. According to the chosen value cutoff, you can make the selection using the statistical significance of your values as a criterion (for instance, a column containing e-values shown in Figure 43). 57

The utility will differentially color rows with lower and higher values and others with non-significant hits according to the established selection of colors in these tasks.

62 Figure 43. Selecting rows by expected or statistical values. (1) The figure indicates the window dialog opened when you clicked on this option. Enter a value cutoff criterion, and choose a numerical column of reference. The utility will differentially color rows with lower and higher values and others with non-significant hits according to the established selection of colors in these tasks. You can modify the colors code by clicking in each color box and can tell GPRO to check in the worksheet, the rows emphasized in any color (as shown in the figure). Then click OK for running the script. (2) A resume dialog will indicate the number of the different selected rows. (3) Shows an example of a colored worksheet according to the selected criteria BY COLOR: This tab provides an additional utility for selecting previously colored rows (Figure 44). Figure 44. Selecting annotated rows differentiated by colors. (1) The figure indicates the window dialog opened when you clicked on this option. Choose one of the colors (orange in the example) corresponding to the rows previously colored according to your criteria. Note that you can also check the rows with the selected color using the mouse. Press OK to run the script. (2) A resume dialog will indicate the number of the rows selected according to the color. (3) Shows an example of a worksheet in which only the orange rows are checked. 58

63 REMOVE SELECTED SEQUENCES: You can use this tab to refine your database or omic project by removing previously selected rows from the worksheet Associate database Using this command allows users to associate a specific FASTA database with its worksheet using one or more worksheet columns as the reference. The object of this utility is to update the information of the associated database while simultaneously editing the text in the worksheet. There follows an explanation of how to proceed ASSOCIATE DATABASE: Using the Associate database option you can select a FASTA database file and associate it with your worksheet according to reference columns (Figure 45). To make the association successfully, the contents of the selected columns must be found in the FASTA header of the sequences within the database. If the association has been carried out, GPRO will provide information concerning the associated file in the footer of the worksheet (see Figure 45). The sequence names in the FASTA Explorer list of an associated FASTA file are marked by green squares if they are identical to those of the worksheet. In contrast, if they do not match they will be marked in red. 59

64 Figure 45. Associating a database file with a worksheet. (1) Clicking on the Associate database function opens a new window. (2) Using the worksheet columns option, select on the left area the reference columns for the association (in the example Sequence and Function ) and transfer them to the right area (sequence columns). If required, choose the column separator char. (3) Associate a FASTA file with the active worksheet; browse it in your directory. The script will present a preview of the worksheet selected sequence columns titles and that of the FASTA header. Press OK for running the script. (4) The association resume dialog will indicate the number of the worksheet rows and the FASTA sequences that have been associated. If the association was successful (check the footer of both files), any change performed in the worksheet column will be automatically applied to the FASTA header of the sequence associated with the worksheet row in the FASTA file REMOVE ASSOCIATION: Associations between the worksheet and database file are maintained even if GPRO is closed. This option allows you to remove an association. 60

6. MOUSE FUNCTIONS AND TRICKS In this section, the diverse actions of GPRO are described; these are accessible by clicking on the right button of the mouse.

65 6. MOUSE FUNCTIONS AND TRICKS In this section, the diverse actions of GPRO are described; these are accessible by clicking on the right button of the mouse. Doing so will open a contextual small menu allowing users to execute additional functions. This menu varies depending on the GPRO section - Directory, FTP, Worksheet and Editor -, on which the mouse is positioned Directory and FTP By positioning and right-clicking the mouse in the Directory and FTP (or in any folder/file therein) these menus appear, each providing a dialog with utilities for managing the distinct contents therein (Figure 46). Figure 46. Directory and FTP contextual menus. The Directory has functions to create a new file or folder and open a database as a worksheet, open a worksheet and database files, open a file with the TIME editor, and other basic tools (cut, copy, paste, delete and rename). Finally, the Project properties allows the user to consult the data contained in a selected folder. In the FTP menu, the functions permit the user to create a new folder, download previously selected items and delete, rename and compress files and folders. In both cases, the mouse image indicates that the menu appears when the user clicks on the right button Database Editor By positioning and right-clicking the mouse in the Database editor, this contextual menu provides distinct functions for editing (cut, copy and paste) contents. In addition, by clicking on any sequence name summarized in the FASTA Explorer, the 61

database editor browses the edited file to reach the select sequence, which is emphasized in color (Figure 47). Figure 47. Database Editor contextual menu.

66 database editor browses the edited file to reach the select sequence, which is emphasized in color (Figure 47). Figure 47. Database Editor contextual menu. In this example the blue highlighted sequence in the FASTA file appears when selecting a sequence name in the Fasta Explorer. The contextual menu shown in the central area displays the functions necessary to save a file; cut, copy and paste; while that of the Fasta Explorer allows the selected sequence to be opened in the TIME editor, and the ability to delete and rename the name of selected sequences. The mouse image indicates that the menu appears when the user clicks on the right button, while the hand shows the menu displayed when clicking on a name sequence on the list Worksheet Using the mouse the user can achieve a variety of actions in the worksheet, which are additional to those available in the worksheet main menu. These include: (a) to check and manually select rows or columns; (b) to switch columns by dragging them from one position into another; (c) to open the contextual menu, which provides additional actions (add, select, remove, etc.) for managing rows and columns (as described in Figure 48). 62

Figure 48. Worksheet contextual menu. By pressing the right button (mouse image in the figure) the contextual menu offers two possibilities for managing rows and columns.

67 Figure 48. Worksheet contextual menu. By pressing the right button (mouse image in the figure) the contextual menu offers two possibilities for managing rows and columns. In the row menu, the user can check and uncheck selected rows, and copy, cut, paste and delete checked rows, and add new rows. In the same way, the column menu allows users to select and unselect columns, add new ones, and delete, rename, join and spilt selected columns, and to fit the content of some columns. Each column can be switched by dragging it from one position to another (as the Hit from arrow demonstrates in this example). The hand image indicates that the user can check rows and columns by clicking on the corresponding checkboxes (as the arrows indicate). 7. ACKNOWLEDGEMENTS GPRO 1.0 has been supported in part by grant IDI from CDTI (Centro de Desarrollo Tecnológico Industrial) and PTQ and PTQ from MICINN (Ministerio de Ciencia e Innovación) in Spain. 8. CITING GPRO If you publish findings obtained using GPRO and wish to cite the tool, it is suggested that you use the following publication: R. Futami, A. Muñoz-Pomer, JM. Viu, L. Dominguez-Escribá, L. Covelli, GP. Bernet, JM. Sempere, A. Moya, C. Llorens. GPRO: the professional tool for annotation, management and functional analysis of omic databases. (2011) Biotechvana Bioinformatics: 2011-SOFT3 9. REFERENCE LIST 63

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi