BIOEXTRACT SERVER TUTORIAL. Workflows within the BioExtract Server Leveraging iplant Resources. Title: Creating Bioinformatic

BIOEXTRACT SERVER TUTORIAL Title: Creating Bioinformatic Workflows within the BioExtract Server Leveraging iplant Resources Carol Lushbough Assistant Professor of Computer Science University of South Dakota Rion Dooley Manager, Web & Cloud Services Group Research Associate Texas Advanced Computing Center University of Texas bioextract.org

Contents Introduction... 2 Quick Start... 2 Overview of the iplant Discovery Environment... 10 Uploading a file to the iplant Discovery Environment... 10 Overview of the iplant Foundation API... 12 Exploring the iplant foundation API... 12 BioExtract Server Overview... 15 Registering with the BioExtract Server... 16 Registering with the BioExtract Server using your iplant id:... 16 BioExtract Server Data Query... 16 Creating a query find all protein records from UniProtKB related to an myb gene in Panicoideae.... 19 BioExtract Server Data Extracts... 20 Creating a query with Boolean expression OR Find all NCBI Nucleotide records associated with the CDH4 or C4 gene.... 20 Filtering data extracts... 22 Saving data extracts... 23 Analyzing Data within the BioExtract Server... 24 Executing an analytic tool... 27 Viewing analytic tool results... 28 Executing iplant analytic tool within the BioExtract Server... 31 Using an iplant Discovery Environment (DE) File as Input... 33 BioExtract Server Workflows... 36 Preparing to record a workflow... 36 Creating a workflow... 37 Saving and executing a workflow... 43 Viewing workflow provenance information... 45 Modifying a workflow... 46 References... 48 Page 1

Introduction This Tutorial is designed to acquaint researchers with the BioExtract Server [2] and the iplant Collaborative [3] Foundation API. It is designed for individuals interested to learn, improve or update their knowledge about bioinformatics workflows and leveraging iplant resources through their Foundation API. The target audience such as scientists from different fields including Biologists and Software Developers, from various levels including researchers, educators, graduate students, and other scientific staff who either work with biological global data or are interested in understanding how to incorporate such data into their specific research workflows. Quick Start A. Register with BioExtract Server 1. Click on the register link in the upper right hand corner of the BioExtract Server screen. You will be presented with the "Create Account" interface. 2. Click the Register iplant Account tab on the "Create Account" screen. (If you have not yet done so, go to iplant Collaborative Discovery Environment (http://www.iplantcollaborative.org/discover/discovery-environment) and register). 3. Enter your iplant id, password, and email address. 4. Click the Register Account button B. Login into BioExtract Server 1. Click on sign in link 2. Enter your iplant user name and password. Page 2

C. Execute a Query 1. Select the Query tab. 2. Then select the Protein Sequences and check the box next to NCBI protein database. 3. Select gene as Search field and enter FXN as the search term. 4. Click on Add Search Line and select Species as Search field and enter Human as the search term. 5. Add Search Line, select AND NOT, select Definition as Search field and enter Full=Frataxin as the search term. 6. Click the Submit Query button. D. Save data set 1. After query has been executed, the Extracts tab should be active. Page 3

2. Click the Save Extract button 3. Enter an Extract Name and Description 4. This extract becomes a searchable data extract and is listed in the Available Data Sources tree on the Query tab under the Miscellaneous node. E. Execute tcoffee 1. On the Tools tab, select the TCoffee analytic tool under the Alignment Tools node in the Tools tree on the left. 2. Click the radio button adjacent to the Use records on Extract page formatted as Fasta as input into the tool Page 4

3. Click the execute button. F. View The output 1. Select the desired output file from the Tool Results drop down list. 2. Click the View Results button. G. Save results to iplant 1. Select the desired output file from the Tool Results drop down list. 2. Click the Save Results button. 3. Navigate to the desired iplant Data Store folder Page 5

4. Click the Create File button 5. Enter the desired file name. 6. Click OK H. Execute Muscle 1. Select the clustalo-lonestar tool under the iplant node in the Tools tree on the left. 2. Specify that you would like to Use previously executed tool results for input into Muscle. 3. Select TCoff ee from the Select Tool drop down list and sequence.fasta file. 4. Set the Force sequence input file format to fasta 5. Set the Force sequence type to Protein Page 6

6. Click the Execute button I. Save workflow 1. Click on the Workflow tab 2. Click the Create and Import Workflows node at the top of the Workflows tree. 3. Type in a Name and Description for your workflow 4. Click the Save Button 5. Click on your workflow name in the workflow tree 6. Click the Start button at the top of the workflow graphics panel (Note that the color Green indicates that the process has completed, blue process is executing and yellow is wait.) Page 7

7. Once a process has completed, you can click on the node and to view the results. 8. Once the workflow has completed, you can click on Provenance button at the bottom of the panel to view the workflow provenance information. Page 8

J. Modifying Workflow 1. Expand you workflow node. 2. Expand the Query process and modify the query to search for the wcag gene in Salmonella Typhimurium and click the Save button. common:gene=wcag AND common:species=salmonella AND common:defn='typhimurium' 3. Click on the name of the workflow and rerun it by clicking on Start. Page 9

Overview of the iplant Discovery Environment The iplant Discovery Environment (DE) [4] is one of the ways researchers can leverage iplant Collaborative resources. Rather than managing computing resource details, or learning new software for every type of analysis, the DE allows you to manage, analyze, and share large dataset. Uploading a file to the iplant Discovery Environment 1. Login to the iplant DE at http://www.iplantcollaborative.org/discover/discovery-environment. 2. Click on the Data icon in the upper left portion of the screen. 3. Click the icon on the Discovery Environment page. Upload a file to the DE by expanding the data folder and clicking the import button in the upper left corner of the data screen. Select Simple Upload from Desktop. Page 10

4. You will be prompted to select files from your desktop. Select the c:\working\export- 1.txt. After selecting a file, click Upload. 5. To view the file after it has been uploaded, click on its name. Page 11

Overview of the iplant Foundation API iplant Collaborative offers a low-level, HTTP- and command-line level API that provides fine-grained access to the storage, authentication, data manipulation, and storage infrastructure maintained by iplant. The iplant IO service API enables the asynchronous movement of file data into and out of the iplant DE. Exploring the iplant foundation API 1. Navigate to http://iplant-dev.tacc.utexas.edu/v1/foundation-backbone/ in your browser. Click on Login In. 2. Enter you iplant username and password. (Note: if you are not an iplant register user, go to http://www.iplantcollaborative.org/ and click the Login or Register link at the top of the screen). Click Get Token. 3. Click Validate. Page 12

4. After you have logged in, click the I/O option and select Browse Files. The list of files you have stored in your DE will appear on the screen. API Call: GET: /io-v1/io/list/<<username>>/ 5. After you have logged in, click the I/O option and select Browse Files. The list of files you have stored in your DE will appear on the screen. 6. By clicking the Apps option, you can retrieve a list of all public tools or those tools that are shared with you. API Call: GET: /apps-v1/apps/list Page 13

7. To execute a tool, click on its name, provide the input an parameter information and click Submit. API Call: GET: /apps-v1/apps/form/clustalw2-lonestar-2.1u2 Page 14

BioExtract Server Overview The BioExtract Server (bioextract.org), funded by the United States National Science Foundation, is a Web-based, workflow-enabling system designed to aid researchers in the analysis of genomic data by providing a platform for the creation of bioinformatics workflows. The BioExtract Server provides: 1) a flexible querying and retrieval interface to National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI) non-redundant nucleotide and protein databases, 2) the ability to filter query results and use them as input into analytic tools, 3) the facility to save query results as data extracts which are automatically integrated into the system as searchable data sets, 4) access to analytic tools including a large list of curated Web services such as Emboss (http://www.ebi.ac.uk/tools/emboss) [5] and BioMart (http://www.biomart.org/) [6] resources, 5) the ability to save a series of BioExtract Server tasks (e.g. query a data source, save a data extract and execute an analytic tool) as a workflow, and 6) the opportunity for researchers to share their data extracts, analytic tools and workflows with collaborators. The BioExtract Server functionality includes the ability to: query multiple data sources export search results save search results as searchable data extracts execute distributed analytic tools create, execute, modify, report, and share workflows Page 15

Registering with the BioExtract Server New BioExtract Server users have the option of utilizing the BioExtract Server as a guest or register. As a guest, researchers can browse, search for data, access the BioExtract Server s public tools, and execute the public workflows. By registering, users have many more options available such as: save query results, add tools to their account, and create, modify, and share workflows with others. Functionality has been added to the BioExtract Server to allow users to register with their iplant Collaborative id giving them access to their iplant resources within the BioExtract Server. Registering with the BioExtract Server using your iplant id: 1. Click on the register link in the upper right hand corner of the BioExtract Server screen. You will be presented with the "Create Account" interface. 2. Click the Register iplant Account tab on the "Create Account" screen. (If you have not yet done so, go to iplant Collaborative Discovery Environment (http://www.iplantcollaborative.org/discover/discovery-environment) and register). 3. Enter your iplant id, password, and email address. 4. Click the Register Account button BioExtract Server Data Query Data sources are collections of nucleotide and protein sequence data you can search through. The BioExtract Server provides access to the following data sources: European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database (also known as EMBL-Bank) GenBank nucleotide and protein sequence databases at the National Center for Biotechnology Information (NCBI) [7] Page 16

European Bioinformatics Institute's (EBI) Universal Protein Resource (UniProt) and UniProt Reference Clusters (UniRef) databases [8] Data sources can be found in the Query tab within the Available Data Sources box. They are arranged into different groups. The Nucleotide Sequences and Protein Sequences groups contain DNA and protein sequence data. The Miscellaneous, Viridiplantae and Viridiplantae Protein groups contain plant-specific proteins and nucleotides. The All group contains a list of all predefined data sources as well as any data extracted created by or shared with the logged in user. Nucleotide Sequences EMBL-Bank (Update since last release): o o o o EMBL-Bank: The full quarterly release of all EMBL-Bank entries except MGA and EMBL-CDS entries. EMBL-Bank All EMBL-Bank entries created or updated after the latest EMBL-Bank release except CON, MGA or EMBL-CDS entries. EMBL-Bank (Deleted Entries): Entries no longer present in the latest EMBL-Bank release. EMBL-Bank (Coding Sequence): Full release of all EMBL-CDS entries. NCBI Nucleotide Databases (http://www.ncbi.nlm.nih.gov/guide/all/#databases_) o o o o Protein Sequences Nucleotide (nuccore): Contains all nucleotide sequences not in EST or GSS. EST (Expressed Sequence Tags): Contains short single-pass reads of cdna (transcript) sequences. GSS (Genome Survey Sequences): Contains short single-pass reads of genomic DNA. Nucleotide: Contains the sequence data in GenBank, EMBL and DDBJ, including all of nuccore, EST and GSS. NCBI Protein Database: Contains sequence data from the translated coding regions from DNA sequences in EMBL/GenBank/DDBJ, as well as protein sequences submitted to PIR, SwissProt, PRF, and PDB. UniRef: The UniProt NREF (UniProt Reference Clusters) database. In the UniRef90 and UniRef50 databases no pair of sequences in the representative set has >90% or >50% mutual sequence identity. The UniRef100 database presents identical sequences and sub-fragments as a single entry. UniProtKB: The UniProt Knowledgebase (UniProtKB) is a complete annotated protein sequence database. More information (http://www.uniprot.org/help/uniprotkb) Page 17

Plant-specific nucleotides and proteins Miscellaneous o o GB-PLN (DNA): GenBank plant nucleotide sequence data comprising the entire PLN division from NCBI GB-PLN (protein): GenBank plant protein sequence data comprising the entire PLN division from NCBI Viridiplantae and Viridiplantae Protein: GB-PLN DNA and protein sequences for green plants. Data are updated monthly from NCBI. To select a data source, click the checkbox next to the data source name. All selected data source names, both nucleotide and protein, will appear in the adjacent right box. Deselecting a checkbox will remove the data sources name from the list. The Search Field drop down list displays the list of key words that may be used to search the data. An important caution The list of search fields is a function of the selected data sources and not every data source includes all the available search fields. For example, if you were to select UniProt KB, a data source located within the Protein Sequences group, then two of the twelve available search fields, Qualifier and Feature Key, would be disabled. You perform queries to retrieve data from data sources for further processing. Other important information If you know the GI/accession number of the sequence(s) you want to retrieve, use Fetch Sequence Records in the Fetch Sequence(s) box on the right side of the Query page. Retrieved records will display on the Extracts page. Records retrieved using the Fetch Sequence Records functionality, can be saved as a Searchable data source. The name you provide will appear in the list of All data sources on The Query page. Page 18

Creating a query find all protein records from UniProtKB related to an myb gene in Panicoideae. 1. From the Query page, select UniProtKB dataset under the Protein Sequence Node in the Available Data Sources tree. When a data source is selected, its name will appear in the adjacent right box. 2. Next, go to the Query Form at the bottom of the page. Use the Search Field drop-down menu to focus your search to a specific part of a data source (such as Title, Gene or Species). Or, to search through all of a data source's text, choose "All Text. For our example, select Gene. 3. In the Search Term box, type in the desired search item. You can type a word or a phrase such as "kinase" or "heat shock factor". Limited wild card searchers are also permitted. For our example type myb*. 4. Click the Add Search Line button to add an additional search expression. Boolean search expression options are AND, OR, AND NOT. For our example, select AND. 5. From the Search Field drop-down menu, select Taxonomy. 6. In the Search Term box, type in Panicoideae. 7. Click Submit Query button. Query results will display under the Extracts tab. 8. Query should return approximately 110 records that are displayed on the Extracts page. Page 19

Creating a query with Boolean expression OR Find all NCBI Nucleotide records associated with the CDH4 or C4 gene. 1. Select the Nucleotide (nuccore) data source under the Nucleotide Sequences/ NCBI Nucleotide Database. (Remove the UniProtKB data source if it is still selected by unchecking it in the data source tree) 2. Using two search lines, enter: Gene = cdh4 OR Gene = C4...this will return records that mention one or the other gene name, but not necessarily both. BioExtract Server Data Extracts From the Extracts page, you can save results as a searchable data extract, export results in FASTA format and view detailed data for listed records. In addition, results may be filtered and used as input into analytic tools. Page 20

Save Extract button: The ability to save results as a data extract is available only to users who have registered with the BioExtract Server. Once saved, data extracts are listed with other available data sources on the Query tab under Miscellaneous. Export Records button Download records from the current result page or all of the results. Records download in FASTA format. See number of matches found. Data sources searched and the number of records found in each. Numbered links allow you to go to any Result Page. First moves to the first page, and last moves to the final page. Select Records button: Displays buttons and check boxes to narrow down results. External Link: See detailed data about the clicked record at that data source's web site. Local Details: Displays the file for the clicked record. Description: Displays a short description of the record. Page 21

Filtering data extracts 1. Select "Nucleotide (nuccore)" 2. Execute the query: Gene = cdh4 OR Gene = C4 3. Click the Submit Query button 4. Click the Select Records button on the Extract page 5. Click the Check Boxes adjacent to some of the records in the result set. 6. Click the Keep Only Select Records button Page 22

Saving data extracts 1. From the Query page, select UniProtKB dataset under Protein Sequence in the Available Data Sources tree. When a data source is selected, its name will appear in the adjacent right box. 2. Next, go to the Query Form at the bottom of the page. Use the Search Field drop-down menu to focus your search to a specific part of a data source (such as Title, Gene or Species). Or, to search through all of a data source's text, choose "All Text. For our example, select Gene. 3. In the Search Term box, type in the desired search item. You can type a word or a phrase such as "kinase" or "heat shock factor". Limited wild card searchers are also permitted. For our example type myb*. 4. Click the Add Search Line button to add an additional search expression. Boolean search expression options are AND, OR, AND NOT. For our example, select AND. 5. From the Search Field drop-down menu, select Taxonomy. 6. In the Search Term box, type in Panicoideae. 7. After query is complete, click Save Extract button on the Extract page. Enter Uniprot Panicoideae myb for extract name and description. 8. Click Create Extract 9. Under the Miscellaneous node in the Available Data Souces on the Query page select the Uniprot Panicoideae myb data source and reexecute the query. Uniprot Panicoideae myb is a searchable data source. Page 23

Other important information All data extracts created within the BioExtract Server are privately owned by the user and are only made available to others by explicitly sharing them with a group. This is accomplished by: (i) navigating to the Groups tab; (ii) creating a group under additional actions; (iii) clicking on the new group; (iv) selecting the Extracts tab for the new group; and finally (v) clicking the Add Elements button to select the data extract to share. Data extracts may also be created by using the Fetch Sequence Records tool on the Query page. After selecting the tool you are asked to enter or upload a list of sequence record identifiers, such as accession numbers. You are also asked to specify the database for searching, such as NCBI, EMBL or UniProt. Results will display on the Extracts page. You can select records listed across multiple result pages before clicking on the Keep Only Selected Records button. Clicking on the Keep Only Selected Records button permanently updates the list of results. If you want to see your original results, you'll need to re execute the query. Analyzing Data within the BioExtract Server The BioExtract Server provides access to a number of bioinformatic analytic tools, with the majority integrated as curated web services. Users access analytic tools through the list of Available Tools on the Tools page. Tools are arranged in groups (e.g. Alignment Tools, BioMart, Nucleic Tools and Similarity Search Tools). Browse through the different groups by clicking on the plus and minus signs. Not sure which tool to use? Check the tool's help pages. To do this, choose a tool of interest. The tool s form will open in the right panel. Click More Information at the top of the tool form. A new window to tool help will open. The basic steps for executing tools are: Step 1. Select a tool Step 2. Input some data Step 3. Define parameters Step 4. Click Execute and wait Step 5. View tool results Selecting an analytic tool Page 24

From the Tools page, in the Tools list in the left panel, you are able to select the desired analytic tool. For example, click Similarity Search Tools (click on the name or the plus sign) and you're offered: blastn, blastp, blastx, tblastn and tblastx. When you find the tool you need, click on its name. The tool s form will open in the right panel. You ll use this form to get more information about the tool to, input your data, define parameters, execute the tool, and view tool results. Providing input into an analytic tool Go to the Input Data section near the top of the tool form. BioExtract Server offers several different ways to input data: To input records listed on the Extracts page (like query results and user-saved data extracts), choose Use records on Extracts page formatted as FASTA. To input results from an executed tool, choose Use previously executed tool results. In the associated drop down menus, select a tool and a result file. If you want to input a data file saved on your computer, choose Upload data saved on your computer. Click Browse or Choose File. In the open dialog box, find the file you wish to upload. If you are a registered iplant user and want to use data stored in you iplant Discovery Environment, click the Import a file from iplant radio button and click the Select File button. If you want to paste or type in data, choose Paste or type data into the text area and then enter your data. Error from tools Make sure the entered data matches the format requirements of the selected tool. To get information about a tool's format requirements, click More Information at the top of the tool form. A window to tool help will open. Check format requirements. If there is a format mismatch, you can use BioExtract Server's FormatConversion tool to correct the issue. Setting analytic tool parameters Go to the Parameter Settings section of the GUI for the selected tool. You can keep the parameter settings at their defaults or change them according to your data and preferences. If you need help with parameters, click the More Information link at the top of the tool form. A window to tool help will open. Executing an analytic tool Click the Execute button at the bottom of the tool form. The button will change to read Terminate and a status message will appear just below it. After successful completion, the status message will read Execution Complete. Viewing analytic tool output files Analytic tool output files can be found in the Tool Results drop down menu at the bottom of the tool form. To view tool results, open this menu, select a file of interest and then click View Results. A new window will open with the results. Result files may be viewed, downloaded and used as input into subsequently executed analytic tools. Page 25

Output files may be saved to your iplant Discovery Environment by clicking the Save Result button adjacent to the Tool Results. Other important information Some tools (like Basic Local Alignment Search Tool, BLAST) are able to turn their results into a list which displays on the Extracts page. If a tool has this ability, a pop-up displaying Use the Tool Results dropdown menu and the Extracts page to view the results will appear once the tool successfully executes. Records in this list can be filtered, exported and used as input into subsequently executed analytic tools. Users who are registered and signed into the BioExtract Server also have the option of saving these records as a data extract for future use. Page 26

Executing an analytic tool 1. Select "blastn" from the list of Similarity Search Tools in the Tools list 2. Select the Paste or type data into the text area radio button 3. Enter: XM_001061943 4. Click the Execute button Note: the BLAST tools are configured to create a result set on the Extracts page Page 27

Viewing analytic tool results 1. Start from an executed blastn tool ( see Executing an analytic tool ) 2. Click on the Tool Results drop down menu at the bottom of the tool form. 3. Select the blast_results.html file. 4. Click View Results. 5. A window will open displaying tool results. 6. Click the Download This File link in the upper left corner 7. In the open dialog box, choose where you want to save the file 8. To save the output to your iplant DE, select the blast_results.html file from the dropdown list and click the Save Result button. Page 28

9. Select the desired directory in which to store the file, then click the Create File button. 10. Enter the desired file name and click OK. 10. View the results of the operation by logging into your iplant DE. Page 29

iplant Collaborative Tools iplant Collaborative provides access to a wide variety of biological applications determined by their science advisors and staff to be fundamental to the science infrastructure, and which directly supports specific scientific objectives. These applications are installed, supported, and maintained by iplant staff. These applications are deployed directly on high-performance cluster systems or high-performance VMs. Deployments of these applications are tuned for optimal performance and scalability in a collaborative effort between the primary software author and iplant staff. These applications are discoverable and usable by all authenticated users of the iplant Cyberinfrastructure (CI). The BioExtract Server team has deployed additional application to the iplant CI and made them available to BioExtract Server researchers. Any of the tools that you have deployed to iplant are also added to the list. The BioExtract Server's iplant interface does not differ from the majority of other tools in the list of tools. Select an iplant Tool by clicking its entry in the Available Tools list in the left panel. The interface for the selected tool appears in the right panel. Clicking the Execute button will execute the tool. Once execution has been completed, the View Results button is enabled. Clicking it will display the output files associated with the tool execution. Clicking on the name of an output file will display the contents of the file. The output from iplant analytic tool execution will automatically be stored in your iplant Discovery Environment. iplant Discovery Environment (DE) is a system of software and hardware that provides a modern web interface and platform for powerful computing, data, and application resources. The DE facilitates data exploration and scientific discovery by integrating powerful, community-recommended software tools into a system that is robust enough to handle data while utilizing high performance computing resources like XSEDE (formerly known as TeraGrid) and others as needed to perform these tasks much more quickly. Page 30

As a register iplant user, you are offered the ability to manage personal data on your DE platform. The Discovery Environment uses irods, which is also used by and accessible by other iplant services. Your data is safe, easy for you to access, and not locked in to only one method. Executing iplant analytic tool within the BioExtract Server 1. In the Query tab, search NCBI for all Arabidopsis Argonaute proteins. 2. Optionally, click the "Extracts" tab to view the results of the query. Page 31

3. In the Tools tab, select the clustalo-lonestar iplant tool. Specify that the tool should use records on the Extracts page formatted as FASTA as the input data. Select fasta for the Force sequence input file format parameter and Protein for Force a sequence type parameter. 4. Next, click Execute. The tool may take two to three minutes to complete. 5. After execution has completed, the View Results button becomes enabled. View output.txt Page 32

Using an iplant Discovery Environment (DE) File as Input 1. On the Query tab, search NCBI for all Arabidopsis Argonaute proteins. 2. On the Tools tab, select the Fetch Sequence Records tool under the Information Tools node. Specify that the tool should Use records on the Extracts page as input and the database parameter should be set to ncbi. 3. After execution has completed, the View Results and Save Result buttons become enabled. Page 33

4. To save the output to your iplant DE, select the result.txt file from the drop-down list and click the Save Result button. Select the desired directory in which to store the file, and then click the Create File button. 5. Enter Aronaute_Arabidopsis.fa for the file name and click OK. 6. In the Tools tab, select the clustalo-lonestar (Clustal Omega running on Lonestar) tool under the iplant node. Page 34

7. Next, select the Import a file from iplant radio button and click the adjacent Select File button. Select the 'Aronaute_Arabidopsis.fa' file from your iplant DE 8. Before executing the tool, verify that the Force sequence input file format is set to fasta and the Force a sequence type is Protein. Click the 'Execute' button. 9. Results of the execution of iplant tools are automatically saved in your iplant DE. Page 35

BioExtract Server Workflows The BioExtract Server workflow system gives users the ability to save a series of BioExtract Server tasks (e.g. querying a data source, saving a data extract and executing analytic tools) as a workflow. Any series of these tasks performed on data is a possible candidate for workflow automation. BioExtract Server's most unique feature may be its ability to record your tasks as you perform them and then turn those tasks into a workflow. This means that you create a workflow by simply performing the tasks in your analysis job. It's that easy! As you perform the tasks in your job, BioExtract Server watches and records each task as a step. Once you've finished all of your tasks, simply provide a name for your workflow and click Save. The BioExtract Server connects the steps together to form a complete workflow represented as a directed acyclic graph. When you execute the workflow, you can execute it as one unit or each step in the workflow can be run individually. A detailed report can also be generated for personal review, publishing or for sharing with colleagues. As a guest, you can study and run BioExtract Server's public workflows. As a registered user, you are able to create, modify and share workflows with colleagues Preparing to record a workflow 1. Login to the BioExtract Server 2. Navigate to the Workflow page 3. Click on the Create and Import Workflows node in the workflow list on the left 4. Click the Record Workflow button Page 36

Creating a workflow Context: The FXN gene provides instructions for making a protein called frataxin. This protein is found in cells throughout the body, with the highest levels in the heart, spinal cord, liver, pancreas, and muscles. The protein is used for voluntary movement (skeletal muscles). Within cells, frataxin is found in energy-producing structures called mitochondria. Although its function is not fully understood, frataxin appears to help assemble clusters of iron and sulfur molecules that are critical for the function of many proteins, including those needed for energy production. Mutations in the FXN gene cause Friedreich ataxia. Friedreich ataxia is a genetic condition that affects the nervous system and causes movement problems. Most people with Friedreich ataxia begin to experience the signs and symptoms of the disorder around puberty. Hypothesis: Reduced expression of frataxin is the cause of Friedrich's ataxia (FRDA), a lethal neurodegenerative disease, how about liver cancer? Aim: The purpose of this lab is to initiate online biological exploration tools of the human model large scale data study (metabolic, proteic, genomic, ). We simulated the application on FXN gene and pancreatic cancer disease. Now we can understand how a researcher can come to identify cross biological knowledge available in data banks. 1. Select the Query tab. Then select the Protein Sequences and check the box next to NCBI protein database. Select gene as Search field and enter FXN as the search term. Click on Add Search Line and select Species as Search field and enter Human as the search term. Add Search Line, select AND NOT, select Definition as Search field and enter Full=Frataxin as the search term. Click the Submit Query button. Page 37

2. Click to the "Tools" tab, and then click on Alignment Tools, and showalign. Select Use records on extract page formatted in FASTA. Click on Execute to run the tool. 3. When execution is complete, results can be retrieved by selecting the desired format and clicking on View Results. Page 38

4. Select Similarity search tools, and select blastp. Select Use records on extract page formatted as FASTA. 5. Under Choose search set parameter section, select the database (DATABASE) swissprot and set the Formatting Options parameter maximum number of sequences (MAX_NUM_SEQ) to 10. 6. The resulting records can be viewed on the Extracts page. Page 39

7. To perform a multiple sequence alignment on the similar sequence, execute TCoffee under the Alignment Tools node in the Tools list. Specify that the input should use the records on the Extracts page. 8. After execution has completed, the results may be viewed. Page 40

9. To perform a multiple sequence alignment on the similar sequence, execute Muscle under the iplant node in the Tools list. Select Use previously executed tool results for input and select TCoffee and sequence.fasta 10. After execution has completed, the results may be viewed. Page 41

11. Go to the Tools tab again, select iplant, then clustalo-lonestar. Select Use previously executed tool results for input and select TCoffee and sequence.fasta. Your protein sequences will be automatically incorporated as an input in clustal-omega [1] tool. Make certain that you set the Force sequence input file format to fasta and Force a sequence type to Protein. Execute the tool. 12. After execution has completed, use the pull down for Tool Results and select output.txt before viewing the results. Page 42

Saving and executing a workflow (Note: this should directly follow Creating a Workflow) 1. Go back to the Workflow tab and click Create and import workflows. Write a name and a description for your workflow then click on Save. All the previous steps will be saved in this workflow. 2. Once the workflow saves, you will find it listed along with the other workflows on the left. Click on the name of the workflow to have a schematic view of it. 3. Run the workflow by clicking on Start. Page 43

4. After a process in the Workflow has completed (color is green), you can view the results by right clicking on the process and selecting more information. 5. General information regarding the process is displayed. The inputs and outputs for the process can be viewed or saved by clicking on View File. Page 44

Viewing workflow provenance information 1. Once the Workflow has completed executing, you can view the provenance report by clicking the Provenance button at the bottom of the screen 2. General information regarding the workflow is displayed. The inputs and outputs for each process can be viewed or saved by clicking on View File adjacent to its name. This report also records information such as parameter setting date workflow was created, workflow description etc. Page 45

Modifying a workflow 1. Go back to the Workflow tab and expand the Alignment and Similarity workflow 2. Expand the Query process and modify the query to search for the wcag gene in Salmonella Typhimurium and click the Save button. common:gene=wcag AND common:species=salmonella AND common:defn='typhimurium' 3. Run the workflow by clicking on Start. Page 46

The following table provides the mapping of the GUI Search Fields (i.e. values that appear in the Search Filed drop-down box on the Query page) to the common search fields. GUI Search Field All Text Id Author Title Accession Definition Feature Key Gene Keywords Species Taxonomy Common search fields common:all common:id common: author common: title common: accn common: defn common: fkey common: gene common: keyword common: species common: taxonomy Page 47

References [1] F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Soding, J. D. Thompson, and D. G. Higgins, "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega," Mol Syst Biol, vol. 7, p. 539, 2011. [2] C. M. Lushbough, D. M. Jennewein, and V. P. Brendel, "The BioExtract Server: a web-based bioinformatic workflow platform," Nucleic Acids Res, vol. 39, pp. W528-32, Jul 2011. [3] S. A. Goff, M. Vaughn, S. McKay, E. Lyons, A. E. Stapleton, D. Gessler, N. Matasci, L. Wang, M. Hanlon, A. Lenards, A. Muir, N. Merchant, S. Lowry, S. Mock, M. Helmke, A. Kubach, M. Narro, N. Hopkins, D. Micklos, U. Hilgert, M. Gonzales, C. Jordan, E. Skidmore, R. Dooley, J. Cazes, R. McLay, Z. Lu, S. Pasternak, L. Koesterke, W. H. Piel, R. Grene, C. Noutsos, K. Gendler, X. Feng, C. Tang, M. Lent, S. J. Kim, K. Kvilekval, B. S. Manjunath, V. Tannen, A. Stamatakis, M. Sanderson, S. M. Welch, K. A. Cranston, P. Soltis, D. Soltis, B. O'Meara, C. Ane, T. Brutnell, D. J. Kleibenstein, J. W. White, J. Leebens-Mack, M. J. Donoghue, E. P. Spalding, T. J. Vision, C. R. Myers, D. Lowenthal, B. J. Enquist, B. Boyle, A. Akoglu, G. Andrews, S. Ram, D. Ware, L. Stein, and D. Stanzione, "The iplant Collaborative: Cyberinfrastructure for Plant Biology," Front Plant Sci, vol. 2, p. 34, 2011. [4] S. L. Oliver, A. J. Lenards, R. A. Barthelson, N. Merchant, and S. J. McKay, "Using the iplant Collaborative Discovery Environment," Curr Protoc Bioinformatics, vol. Chapter 1, p. Unit1 22, Jun 2013. [5] P. Rice, I. Longden, and A. Bleasby, "EMBOSS: the European Molecular Biology Open Software Suite," Trends Genet, vol. 16, pp. 276-7, Jun 2000. [6] A. Kasprzyk, "BioMart: driving a paradigm change in biological data management," Database (Oxford), vol. 2011, p. bar049, 2011. [7] D. Mrozek, B. Malysiak-Mrozek, and A. Siaznik, "search GenBank: interactive orchestration and ad-hoc choreography of Web services in the exploration of the biomedical resources of the National Center For Biotechnology Information," BMC Bioinformatics, vol. 14, p. 73, 2013. [8] M. Magrane and U. Consortium, "UniProt Knowledgebase: a hub of integrated protein data," Database (Oxford), vol. 2011, p. bar009, 2011. Page 48