OrthoMCL v1.4. Recall: Web Service: Datadoc v.1 1/29/ Algorithm Description (SCIENCE)

Size: px

Start display at page:

Download "OrthoMCL v1.4. Recall: Web Service: Datadoc v.1 1/29/ Algorithm Description (SCIENCE)"

Garey Shields
5 years ago
Views:

OrthoMCL v1.4 Datadoc v.1 1/29/2007 1. Algorithm Description (SCIENCE) Summary: OrthoMCL is a method that calculates the closest relative to a gene within another species set.

1 OrthoMCL v1.4 Datadoc v.1 1/29/ Algorithm Description (SCIENCE) Summary: OrthoMCL is a method that calculates the closest relative to a gene within another species set. For example, protein kinase A in Mycobacterium avium has an evolutionary relative in Mycobacterium tuberculosis, and the program will find that gene, even if the relationship has not been described in the literature. Originally OrthoMCL was designed as a pipeline that utilizes a database, where all the data is stored in a GUS relational database (Genomic Unified Schema; (Davidson, Crabtree et al. 2001)). Many MySQL queries were used to retrieve BLAST data in that implementation. However to satisfy the requirement to run ortholog clustering without depositing data into GUS database, a stand-alone version of OrthoMCL was developed as a stand-alone Perl package: It is further described and used here. Scientific Basis: OrthoMCL operates on the basis of pairwise gene homology. To most approximations, genes which are similar in their sequence of nucleotides will also have similar metabolic function. OrthoMCL compares all possible genes between two organisms, and decides which among them is the best match, on the basis of reciprocal best blast hit. Data supplied: Ortholog sets, as tab delimited text Target Organisms: All genes from any kingdom (Bacteria, Virus, and Eukaryote). Run separate jobs for each species e.g Francisella run, Encephalitozoon run, Mycobacterium run, Influenza run. Precision: Unknown. Most predictions are made on completely new gene sequences and verification of each data point requires a separate lab experiment to be performed. Recall: 81.6% of genes are grouped from test organisms (16 bacterial, 4 archaeal genomes, 12 animals, 9 fungi, 1 each microsporidium, Dictyostelium, Entamoeba, 4 plants/algae and 7 apicomplexan parasites). Some are not clustered into groups. Organisms have disparate number of genes. For that reason a complete matrix of orthologs will not be possible by this or any method. Scoring: Files Supplied: Data Structure: Platform: None. 2: Raw data file; Parsed data file for DB loading. Delimited text Perl scripts and Markov libraries. Platform agnostic. References: Li Li, Christian J. Stoeckert, Jr., and David S. Roos (2003) OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 13: Feng Chen, Aaron J. Mackey, Christian J. Stoeckert, Jr., and David S. Roos. (2006). OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 34: D Feng Chen, Aaron J. Mackey, Jeroen K. Vermunt, and David S. Roos. (2007) Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes. PLoS ONE 2(4): e383. Web Service: Software: None Supplemental Docs: Contact : orthomcl@pcbi.upenn.edu

2 2. Data Description: (TECHNICAL) Mods to Algorithm Code: None; new post parsers (Vecna) for Muscle runs and DB inserts. Raw results: ORTHOMCL656(6 genes,6 taxa): (Franc_OSU18) (Franc_U112) (Franc_FTW_WY96) (Franc_FTA) (Franc_Schu4) (Franc_LVS_holarc) ORTHOMCL657(6 genes,6 taxa): (Franc_OSU18) (Franc_U112) (Franc_FTW_WY96) (Franc_FTA) (Franc_Schu4) (Franc_LVS_holarc) ORTHOMCL658(6 genes,6 taxa): (Franc_OSU18) (Franc_U112) (Franc_FTW_WY96) (Franc_FTA) (Franc_Schu4) (Franc_LVS_holarc) ORTHOMCL659(6 genes,6 taxa): (Franc_OSU18) (Franc_U112) (Franc_FTW_WY96) (Franc_FTA) (Franc_Schu4) (Franc_LVS_holarc) ORTHOMCL660(6 genes,6 taxa): (Franc_OSU18) (Franc_U112) (Franc_FTW_WY96) (Franc_FTA) (Franc_Schu4) (Franc_LVS_holarc) There are two Perl scripts that are used to parse the raw output results. The first parsing script, reformat_orthomcl_results.pl, is used to generate the post-processed data for BRCWarehous. The second parsing script, reformat_orthomcl_results-2.pl, is used to generate post-processed data for other data processing pipelines. Parsed results 1: (What will be loaded into the BRCWarehouse) Output of : reformat_orthomcl_results.pl Francisella Francisella Francisella Francisella Francisella Francisella Francisella Francisella Francisella Francisella Francisella Francisella Francisella Francisella Francisella Parsed results 2: (For generating other bioinformatics content via MUSCLE shell script.) Output of reformat_orthomcl_results-2.pl : group group

3 3: SOP BioHealthBase STANDARD OPERATING PROCEDURE TITLE: OrthoMCL Original Issue: 11/29/2007 Revision Date: 1/20/2007 Pages: 4 Prepared By: clarsen, tbriggs SOP ID: BHB:SOP0007: OrthoMCL Summary: This method allows a user to compute the relatives of any gene in question. Definitions: Ortholog: Closest functional homologous relative of a gene from another species MCL: Markov Clustering Algorithm Interferences: Significant runtime against large genomes (Francisella 2 days; Mycobacterium 8 days). Moderately high % CPU usage and multiple direct blastp calls. Procedure in Brief: 1. Collect the proteome files for processing (fasta protein) 2. Run the OrthoMCL 3. Post process the raw files for database loading 4. Post process the raw file for MUSCLE MSA use Where Usage for step 2: $./orthomcl.pl --mode 1 --fa_files Ath.fa,Hsa.fa,Sce.fa NOTE: Do not put spaces in between the strain file names. Data Management: Run the algorithm, parse the data, load the data into DB warehouse. Rerun when any new genome has been added to an organism set (genus)! The data are relative, codependent and will not be computed independently from strain to strain. For example, when a 17 th Mycobacterium strain is run, all the remaining 16 strains data must be discarded and replaced with better whole genus data that includes the new 17 th proteome. QA: Post Parsing: Visual inspection of the document for correct pathway identifiers (Go ID) within one orthology group. Run the raw output summary against a Perl based parser to convert the data into a loadable format. Use the genus as an argument in the run. Use only one genus (Francisella or Mycobacter) perl reformat_orthomcl_results.pl /opt/orthomclv1.4/oct12/all_orthomcl.out Francisella >francisella_orthomcl.out

4 Parser 1: For Content Loading to Database!/usr/bin/perl use warnings; use strict; This is a utility that reformats output files from OrthoMCLs normal output format (from, e.g., "all_orthomcl.out") to a format used by Northrop Grumman. The input format looks like this: ORTHOMCL0(248 genes,6 taxa): (Franc_FSC198) (Franc_FSC198) (Franc_FSC198) (Franc_FSC198) Francisella Francisella Francisella Francisella The "Francisella" is the organism's genus, which isn't in the input, and so has to be passed on the command line. if (scalar(@argv)!= 2) { print "Usage: reformat-orthomcl-results.pl FILENAME GENUS\n"; print "where\n"; print "- FILENAME is the name of the file to reformat. (The input file should be in\n"; print "OrthoMCL's output format.)\n"; print "- GENUS is the genus of the organism under consideration (e.g. \"Francisella\").\n"; print "This is part of the output file.\n"; exit(); my ($FILENAME, $GENUS) open(infile, $FILENAME); For each line in the input file (i.e. each ortholog group)... while (my $line = <INFILE>) { First, get the ortholog group number. my ($groupstr, $rest) = split(/:/, $line); $groupstr =~ /ORTHOMCL(\d+)$/; my $groupnum = $1; Then, split the list of orthologs by whitespace... = split(/\s+/, $rest);... and for each ortholog, verify that it matches a FOO123(BLAH) format. If it does, push it into "GIs" (a list of the validated GIs or ORFs for this line). = (); foreach my $gistr (@GIStringsForGroup) { if ($gistr =~ /^(.+)\(.+$$/) { push(@gis, $1); OK, done this line (i.e. this ortholog group). Print the output-formatted lines. foreach my $gi (@GIs) { print "$GENUS $groupnum $gi\n";

5 Parser 2: For input into Ortholog group MUSCLE jobs For forking off new blast jobs from each line of content; to get MSA of each ortho group. The output of this process is an input file for Muscle. The result is a complete orthology set alignment.!/usr/bin/perl use warnings; use strict; This is a utility that reformats output files from OrthoMCLs normal output format (from, e.g., "all_orthomcl.out") to a format used by Northrop Grumman. The input format looks like this: ORTHOMCL0(248 genes,6 taxa): (Franc_FSC198) (Franc_FSC198) (Franc_FSC198) (Franc_FSC198) [etc...] ORTHOMCL1(244 genes,6 taxa): (Franc_FSC198) (Franc_FSC198) (Franc_FSC198) [etc...] Which would be transformed into : Group Group if (scalar(@argv)!= 1) { print "Usage: reformat-orthomcl-results-2.pl FILENAME\n"; print "where\n"; print "- FILENAME is the name of the file to reformat. (The input file should be in\n"; print "OrthoMCL's output format.)\n"; exit(); my ($FILENAME) open(infile, $FILENAME); For each line in the input file (i.e. each ortholog group)... while (my $line = <INFILE>) { First, get the ortholog group number. my ($groupstr, $rest) = split(/:/, $line); $groupstr =~ /ORTHOMCL(\d+)$/; my $groupnum = $1; Then, split the list of orthologs by whitespace... = split(/\s+/, $rest);... and for each ortholog, verify that it matches a FOO123(BLAH) format. If it does, push it into "GIs" (a list of the validated GIs or ORFs for this line). = (); foreach my $gistr (@GIStringsForGroup) { if ($gistr =~ /^(.+)\(.+$$/) { push(@gis, $1); OK, done this line (i.e. this ortholog group). Print the output-formatted lines. print "group$groupnum\n"; foreach my $gi (@GIs) { print "$gi\n";

6 Linux Install: 1. Installation of required softwares and Perl modules ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ OrthoMCL is a Perl script which doesn't need compilation. However, it requires some software and Perl modules to run, as listed below: Software: 1. BLAST (NCBI-BLAST, WU-BLAST, etc.) *2. MCL (Markov Clustering algorithm), available at NOTE: MCL changed the output format recently which is not compatible with OrthoMCL. Please use the MCL version enclosed with this package, which has been the default for all test analysis. Perl Modules: Perl 1. Bio::SearchIO (part of BioPerl, 2. Storable 3. (5.8.8 or later) 2. Setting the variables in "OrthoMCL/orthomcl_module.pm" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Most global variables used in "OrthoMCL/orthomcl.pl" need to be set in the Perl module "OrthoMCL/orthomcl_module.pm". --- $PATH_TO_ORTHOMCL: the orthomcl directory itself (example: $PATH_TO_ORTHOMCL = "/disk3/fengchen/orthomcl/";) --- $BLASTALL: your BLAST software (example: $BLASTALL = "/genomics/share/bin/blastall";) --- $BLAST_FORMAT: how BLAST result is stored Options: a) "compact" corresponds to NCBI-BLAST's -m 8 b) "full" corresponds to NCBI-BLAST's -m 0 for WU-BLAST, make changes on subroutine executeblastall --- $BLAST_NOCPU: the number of CPUs For multi-processor machine, setting it higher than 1 will significantly save time in BLAST step --- $FORMATDB: your FORMATDB software (example: $FORMATDB = "/genomics/share/bin/formatdb";) --- $MCL: your MCL software (example: $MCL = "/disk2/fengchen/mcl /shmcl/mcl";) --- $MAX_WEIGHT_DEFAULT: Weight used for protein pairs whose BLAST p-value is zero (0). This depends on the algorithm you use: if the second smallest p-value is in the order of -99, maximum_weight should be 100; if -299, maximum_weight should be 300 <DEFAULT>. Now you can run orthomcl.pl on a three-species test set, since the variables $PATH_TO_ORTHOMCL, $BLASTALL, $FORMATDB and $MCL are set. % orthomcl.pl --mode 1 --fa_files "Ath.fa,Hsa.fa,Sce.fa" Note: Here the test set Ath.fa, Hsa.fa and Sce.fa only contain 15, 16 and 11 sequences, respectively. Such a test set is selected to make sure you have everything set and OrthoMCL can run on your machine. Since it takes OrthoMCL long time to finish clustering a big data set, from BLAST to MCL, it's wise to try a very small set first. To use OrthoMCL on your data, you need to collect protein fasta files ".fa" (with each ".fa" file representing one species only, and having a simple name, e.g. "Eco.fa") and put them in the directory "data" or reset the following variable: --- $ORTHOMCL_DATA_DIR: the data directory to store the fasta files

7 $ORTHOMCL_DATA_DIR = $PATH_TO_ORTHOMCL."/data/"; (DEFAULT) 3. Running OrthoMCL ~~~~~~~~~~~~~~~~~~~ The COMPLETE proteome data for each species should be chosen, theoretically. And you should have enough memory (>=800MB) if you have around 100,000 sequences to cluster, because this stand-alone version tries to read BLAST information into memory. There are five modes to run OrthoMCL, with each mode having a different process. We strongly suggest you to use MODE 4 for very big set, since BLAST was not programmed to run parallelly. You can simply prepare two files for mode 4, BPO file and GG file. And it's very fast, for our test set of 200,000 sequences on a Mac G5 computer, it took 8 hours to finish. The five modes of OrthoMCL are: Mode 1: OrthoMCL analysis from FASTA files. OrthoMCL starts from the beginning BLAST to final MCL. Example: % orthomcl.pl --mode 1 --fa_files Ath.fa,Hsa.fa,Sce.fa Mode 2: OrthoMCL analysis based on former OrthoMCL run (former run directory needs to be given), if you want to change the inflation parameter, p-value cutoff (can only be lower than your former run BLAST p-value cutoff), percent identity cutoff or percent match cutoff. No BLAST or BLAST parsing performed. Example: % orthomcl.pl --mode 2 --former_run_dir Sep_8 --inflation 1.4 Mode 3: OrthoMCL analysis from user-provided BLAST result BLAST out file and genome gene relation file telling which genome has which gene (Please refer to 5. File Formats). No BLAST performed. Example: % orthomcl.pl --mode 3 --blast_file AtCeHs_blast.out --gg_file AtCeHs.gg Mode 4: OrthoMCL analysis from user-provided BPO (BLAST PARSING OUT) file and GG (genome gene relation) file telling which genome has which gene (Please refer to 5. File Formats). No BLAST or BLAST parsing performed. Example: % orthomcl.pl --mode 4 --bpo_file AtCeHs.bpo --gg_file AtCeHs.gg Mode 5: OrthoMCL analysis based on previous run, but with less taxa included or with only inflation value changed (FASTER than mode 2, no selection on reciprocal best/better hits performed). Example: % orthomcl.pl --mode 5 --former_run_dir Sep_8 --taxa_file AtCeHs.gg --inflation=1.1

Tutorial 4 BLAST Searching the CHO Genome

Tutorial 4 BLAST Searching the CHO Genome Accessing the CHO Genome BLAST Tool The CHO BLAST server can be accessed by clicking on the BLAST button on the home page or by selecting BLAST from the menu bar