diamond Requirements Time Torque/PBS Examples Diamond with single query (simple)

Size: px

Start display at page:

Download "diamond Requirements Time Torque/PBS Examples Diamond with single query (simple)"

Rolf Marsh
5 years ago
Views:

1 diamond Diamond is a sequence database searching program with the same function as BlastX, but 1000X faster. A whole transcriptome search of the NCBI nr database, for instance, may take weeks using BlastX, but can be completed in hours with Diamond. Highly recommended for transcriptome analysis Requirements 1. reference database, i suggest uniref90 1. but use nr for MEGAN, this coice is handled in the third example 2. query le(s) in fasta or fastq format. All query sequences should be in a single le Time Diamond is very fast large searches trinity assembly with 5.1 M predicted transcripts vs ncbi nr cpu hrs on snyder with 20 ppn, memory~20gb 54.1 Million x 100 base fastq reads (5.4Gbase) cpu hrs on snyder with 20 ppn, memory~35 Gb Torque/PBS Examples 1. Simple job le for a single query. The following output format is useful for annotation because it include the subject sequence annotation line. --outfmt 6 qseqid qlen qstart qend sseqid slen sstart send length pident evalue stitle Diamond with single query (simple)

2 !/bin/bash PBS -q mgribsko PBS -N diamond PBS -l walltime=300:00:00 PBS -l nodes=1:ppn=20 PBS -l epilogue=/home/mgribsko/jobs/epilogue.sh PBS -l naccesspolicy=shared cd $PBS_O_WORKDIR module load diamond echo "indexing database" make indexed database, only need to do this once for each reference diamond makedb --in uniref90.ren.fasta --db uniref90.ren.dmnd echo "starting search" diamond blastx \ --db uniref90.ren.dmnd \ --threads 20 \ --evalue 1e-05 \ --query all_euk_from_megan.ren.fa \ --outfmt 6 \ --out all_euk_megan.ren.uniref90.dmnd.blastx use the --outfmt option above to get BlastTab output for MEGAN6 for more complete output for annotation use --outfmt 6 qseqid qlen qstart qend sseqid slen sstart send length pident evalue stitle diamond v by Benjamin Buchnk <buchnk@gmail.com> Check for updates. Syntax: diamond COMMAND [OPTIONS] Commands: makedb Build DIAMOND database from a FASTA le blastp Align amino acid query sequences against a protein reference database blastx Align DNA query sequences against a protein reference database view View DIAMOND alignment archive (DAA) formatted le help Produce help message version Display version information getseq Retrieve sequences from a DIAMOND database le General options: --threads (-p) number of CPU threads --db (-d) database le --out (-o) output le --outfmt (-f) output format

3 2. Automated script for use with MEGAN6 Diamond automated script (complex)!/bin/bash PBS -q mgribsko PBS -N diamond PBS -l walltime=8:00:00 PBS -l nodes=1:ppn=20 PBS -l epilogue=/home/mgribsko/jobs/epilogue.sh PBS -l naccesspolicy=shared Diamond blastx; 1000X faster than BlastX query is specied in $taget: fasta or fastq nucleotide sequences are OK the alignment parameters are setup for use with MEGAN6 usage qsub diamond_mult.job run on cluster diamond_mult.job debug print command strings and exit index=false run indexing if true use_personal_exe=true use personally installed exe below diamond_exe="/scratch/snyder/m/mgribsko/src/diamond" datadir="." directory where query les are found target="srr123*.fastq" reference="/scratch/snyder/m/mgribsko/src/megan/nr" reference database fasta db="/scratch/snyder/m/mgribsko/src/megan/nr" indexed reference database echo "Diamond" echo "target: $datadir/$target" echo "index: $index" echo "reference: $reference" echo "db: $db" Begin script; hopefully you should have to change nothing below this if [ $PBS_O_WORKDIR ]; then execute only if qsub if $use_personal_exe; then

4 diamond=$diamond_exe echo -e "diamond exe: $diamond\n" else module load diamond echo -e "diamond exe: bioinfo\n" cd $PBS_O_WORKDIR create index; only if $index is true if $index; then echo "indexing database" make indexed database com0=" $diamond_exe makedb \ --in $reference \ --db $db" echo -e "$com0\n" if [[ $1!= "debug" ]]; then $com Execute search for all les matching $target echo -e "starting search\n" shopt -s nullglob for query in $datadir/$target; do out="${query%\.*}.dmnd.blastx" out="${out\.*/}" com1="$diamond blastx \ --threads 20 \ --db $db \ --query $query \ --outfmt 6 \ --top 10 \ --min-score 50 \ --out $out " echo -e "$com1\n" if [[ $1!= "debug" ]]; then $com1

5 done 3. Most recent automated More automated - works for MEGAN or annotation!/bin/bash PBS -q standby PBS -N diamond PBS -l walltime=150:00:00 PBS -l nodes=1:ppn=20 PBS -l epilogue=/home/mgribsko/jobs/epilogue.sh PBS -l naccesspolicy=shared Diamond blastx; 1000X faster than BlastX query is specied in $target: fasta or fastq nucleotide sequences are OK output le names are <query>.<database>.<outsufx> use mode="megan" for get output suitable for MEGAN6 megan mode assumes the database is ncbi nr and that it's located in $megandir otherwise output will be my favorite for annotation usage qsub diamond_mult.job run on cluster diamond_mult.job debug print command strings and exit mode="megan" index=true datadir="." les are found target="*.fasta.reformatted" outsufx=".dmnd.blastx" run indexing if true directory where query sufx for query les sufx for output les use_personal_exe=true use personally installed exe below diamond_exe="/scratch/snyder/m/mgribsko/src/diamond" reference sequence database. normally the indexed db (dbidx) has the same name as the fasta lei (dbfasta), if you give your database a different name you will have to give these symbols different values. Diamond assumes the the index has the sufx.dmnd

6 dbfasta="/scratch/snyder/m/mgribsko/uniref_180625/diamond /180625uniref90.ren.fasta" dbidx=$dbfasta set up for MEGAN mode, if not MEGAN mode, annotation mode is assumed megandir="/scratch/snyder/m/mgribsko/src/megan" megan reference directory mode_opts="--outfmt 6 qseqid qlen qstart qend sseqid slen sstart send length pident evalue stitle" if [[ $mode == "megan" ]] [[ $mode == "MEGAN" ]]; then echo -e "\nrunning in MEGAN mode" dbfasta="$megandir/nr" dbidx=$dbfasta mode_opts="--outfmt 6 \ --min-score 50 \ --top 10" else echo -e "\nrunning in annotation mode" echo "Diamond" echo "query target: $datadir/$target" echo "output sufx: $outsufx" echo "db index: $dbidx" echo "db fasta: $dbfasta" echo "mode: $mode" Begin script; hopefully you should have to change nothing below this run personal version or bioinfo version if $use_personal_exe; then diamond=$diamond_exe echo -e "diamond exe: $diamond\n" else echo -e "diamond exe: bioinfo\n" if [ $PBS_O_WORKDIR ]; then module load diamond if [ $PBS_O_WORKDIR ]; then

7 execute only if qsub cd $PBS_O_WORKDIR create index; only if $index is true index le is normally in the same directory as the fasta le if $index; then echo -e "\nindexing database" make indexed database com0="$diamond_exe makedb \ --in $dbfasta \ --db $dbidx" echo -e "${com0// /\\n\\t }\n" if [[ $1!= "debug" ]]; then $com Execute search for all les matching $target echo -e "\nstarting search" shopt -s nullglob for query in $datadir/$target; do base db name basedb=${dbidx*/} basedb=${basedb%%\.*} out="${query*/}" out="${out%%\.*}.$basedb$outsufx" com1="$diamond blastx \ --threads 20 \ --db $dbidx \ --query $query \ $mode_opts \ --out $out " com1out=${com1// /\\n\\t } echo -e "\n$com1out" if [[ $1!= "debug" ]]; then $com1 done

8 diamond v by Benjamin Buchnk Licensed under the GNU AGPL < Check for updates. Syntax: diamond COMMAND [OPTIONS] Commands: makedb Build DIAMOND database from a FASTA le blastp Align amino acid query sequences against a protein reference database blastx Align DNA query sequences against a protein reference database view View DIAMOND alignment archive (DAA) formatted le help Produce help message version Display version information getseq Retrieve sequences from a DIAMOND database le dbinfo Print information about a DIAMOND database le General options: --threads (-p) number of CPU threads --db (-d) database le --out (-o) output le --outfmt (-f) output format 0 = BLAST pairwise 5 = BLAST XML 6 = BLAST tabular 100 = DIAMOND alignment archive (DAA) 101 = SAM Value 6 may be followed by a space-separated list of these keywords: qseqid means Query Seq - id qlen means Query sequence length sseqid means Subject Seq - id sallseqid means All subject Seq - id(s), separated by a ';' slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence full_sseq means Full subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches

9 positive means Number of positive - scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive - scoring matches qframe means Query frame btop means Blast traceback operations(btop) staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' qcovhsp means Query Coverage Per HSP qtitle means Query title Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore --verbose (-v) verbose console output --log enable debug log --quiet disable console output Makedb options: --in input reference le in FASTA format Aligner options: --query (-q) input query le --strand query strands to search (both/minus/plus) --un le for unaligned queries --unal report unaligned queries (0=no, 1=yes) --max-target-seqs (-k) maximum number of target sequences to report alignments for --top report alignments within this percentage range of top alignment score (overrides --max-target-seqs) --range-culling restrict hit culling to overlapping query ranges --compress compression for output les (0=none, 1=gzip) --evalue (-e) maximum e-value to report alignments (default=0.001) --min-score minimum bit score to report alignments (overrides e-value setting) --id minimum identity% to report an alignment --query-cover minimum query cover% to report an alignment --subject-cover minimum subject cover% to report an alignment --sensitive enable sensitive mode (default: fast) --more-sensitive enable more sensitive mode (default: fast) --block-size (-b) sequence block size in billions of letters (default=2.0) --index-chunks (-c) number of chunks for index processing --tmpdir (-t) directory for temporary les --gapopen gap open penalty --gapextend gap extension penalty --frameshift (-F) frame shift penalty (default=disabled) --matrix score matrix for protein alignment (default=blosum62) --custom-matrix le containing custom scoring matrix

10 --lambda lambda parameter for custom matrix --K K parameter for custom matrix --comp-based-stats enable composition based statistics (0 /1=default) --masking enable masking of low complexity regions (0 /1=default) --query-gencode genetic code to use to translate query (see user manual) --salltitles include full subject titles in DAA le --sallseqid include all subject ids in DAA le --no-self-hits suppress reporting of identical self hits --taxonmap protein accession to taxid mapping le --taxonnodes taxonomy nodes.dmp from NCBI --taxonlist restrict search to list of taxon ids (commaseparated) Advanced options: --algo Seed search algorithm (0=double-indexed /1=query-indexed) --bin number of query bins for seed search --min-orf (-l) ignore translated sequences without an open reading frame of at least this length --freq-sd number of standard deviations for ignoring frequent seeds --id2 minimum number of identities for stage 1 hit --window (-w) window size for local hit search --xdrop (-x) xdrop for ungapped alignment --ungapped-score minimum alignment score to continue local extension --hit-band band for hit verication --hit-score minimum score to keep a tentative alignment --gapped-xdrop (-X) xdrop for gapped alignment in bits --band band for dynamic programming computation --shapes (-s) number of seed shapes (0 = all available) --shape-mask seed shapes --index-mode index mode (0=4x12, 1=16x9) --rank-ratio include subjects within this ratio of last hit (stage 1) --rank-ratio2 include subjects within this ratio of last hit (stage 2) --max-hsps maximum number of HSPs per subject sequence to save for each query --range-cover percentage of query range to be covered for hit culling (default=50) --dbsize effective database size (in letters) --no-auto-append disable auto appending of DAA and DMND le extensions --xml-blord-format Use gnl BL_ORD_ID style format in XML output View options: --daa (-a) DIAMOND alignment archive (DAA) le --forwardonly only show alignments of forward strand

11 Getseq options: --seq Sequence numbers to display. Related articles Bbmap Blast DESeq2/EdgeR diamond FastQC Featurecounts HISAT2 HTSeq Kallisto RSEM Salmon SOAPdenovo-Trans Tophat/Cufflinks Transrate Trimmomatic

diamond v February 15, 2018

diamond v February 15, 2018 The DIAMOND protein aligner Introduction DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. The key features are: Pairwise