User Manual. This is the example for Oases: make color 'VELVET_DIR=/full_path_of_velvet_dir/' 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1'

Size: px

Start display at page:

Download "User Manual. This is the example for Oases: make color 'VELVET_DIR=/full_path_of_velvet_dir/' 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1'"

Nickolas Baker
5 years ago
Views:

1 SATRAP v0.1 - Solid Assembly TRAnslation Program User Manual Introduction A color space assembly must be translated into bases before applying bioinformatics analyses. SATRAP is designed to accomplish this important task adopting a very efficient strategy. The package integrates the Oases pipeline and several optimizations specifically designed for color space management. All steps of the pipeline allow to produce a SOLiD de novo transcriptome assembly and the subsequent color space translation. Alternatively, SATRAP can be used as a stand alone program to perform color space translation for either RNA-seq or DNA-seq SOLiD assemblies. Installing Programs are supported only for UNIX OS and are written in C++. If you are interested to install them in a Windows OS system you may use the GCC Cross Compiler (not tested). For the compilation of binaries, please open a shell and type the command "make". All binaries will be compiled into the "bin" directory. In the case of 32 bits CPU systems you can type "make -f Makefile32" Please see the referred manuals to install the Oases and Velvet programs. Remember that Oases and elvet must be compiled using the color option. This is the example for Oases: make color 'VELVET_DIR=/full_path_of_velvet_dir/' 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1' and this is the other one for Velvet: make color 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1' Important requirements SOLiD DNA-seqs or SOLiD RNA-seqs must be firstly converted into CSFASTQ format. This operation could be easy done using the program csfasta_to_fastq inside the SATRAP package. Alternatively you can download the "CONVERSION TOOLS" available at that allows to manipulate read files and

2 obtain the CSFASTQ format. Easy setting for impatient users Preparing your data (1) create the directory containing the SOLiD RNA-seq data. For instance: mkdir MY_DATASET (2) If you have the native SOLiD data convert them in csfastq format. For instance:./csfasta_fastq -csfasta reads_file,fasta -qual quality_file.qual > MY_DATASET/New_name.csfastq repeat the same operation for all files. Important notice for paired-end assembling If you run only the translation of a SOLiD assembly (Step 3b below) then the name of the two paired csfastq files must be the same, but the last character of the file name must be different, for instance: brain-replicaa-1.csfastq and brain-replicaa-2.csfastq this allows to discriminate the files associated to different sequenced ends. Note that the last character must be at the end of the filename and the suffix remains unmodified. Please check the log of the steps 1 and 3 to make sure that the files are correctly associated. SATRAP execution (3a) Execute the entire analysis. In the case of paired-end sequenced libraries you must make sure that the read tag names are those indicated in the -tags parameter. Typically for SOLiD RNA-seqs they are: _F3 and _F5-RNA. For instance you can run: bin/satrap -step reads_path DATASET/ -file_esten.csfastq \ -tags _F3 _F5-RNA -velvet_path bin/velvet_ oases_path \ bin/oases_0.2.8/ -q 18 -t1 5 -t2 0 -step enables all steps of the analysis -reads_path The directory containing the csfastq files -file_esten The extension name of csfastq files.csfastq -tags Set the 2 tags of the sequenced ends -velvet_path The path of the installed Velvet program -oases_path The path of the installed Oases program -q Parameter of Step 1 reads quality threshold -t1 Parameter of Step 1 Trims the reads of first file at 3' end -t2 Parameter of Step 1 Trims the reads of second file at 3' end

3 If the reads are not paired-end, do not specify the -tags parameter. (3b) Execute only the translation of a SOLiD assembly. You can run the following command for RNA assembly: bin/satrap -step 3 4 -reads_path DATASET/ -file_esten.csfastq \ -fasta transcripts.fa -Q 9 -T2 5 -fasta parameter of Steps 3 and 4 a solid color space assembly -step 3 4 Enables steps 3 and 4 of the analysis -reads_path The directory containing the csfastq files -file_esten The extension name of csfastq files.csfastq -Q Parameter of Step 3 reads quality threshold -T2 Parameter of Step 1 Trims the second read at 3' end Note that the parameter -step is set to enable both the steps 3 and 4 and reads are not paired-end as the -tags parameter is omitted. Furthermore using the parameter -fasta the user intend to translate an own color space assembly generated using another pipeline. For SOLiD DNA-seq assembly the procedure is the same, but the option -no_clustering should be specified. Detailed general explanation Memory requirements The memory requirements for the color space translation is approximatively equivalent to the length of the color space assembly x 10. However depending on the settings both assembly and mapper programs could require more RAM. Parameter setting The entire pipeline includes four consecutive steps: (1) Making double encoded reads for assembling (2) Oases transcriptome assembly (3) Double encoding for translation (4) Color space translation If the SOLiD raw data must be processed to produce a de novo transcriptome assembly, then all the four steps are required (parameter -step ). Alternatively, if a color space assembly is already available, only steps 3 and 4 are required (parameter -step 3 4).

4 In any case you need to create a directory for the csfastq files or to their symbolic links, then you must define the path with the "-reads_path" parameter. If your data are in the native color space format you can use the program "csfasta_to_fastq" inside this package to convert them in the csfastq format. Example of conversion: csfasta_fastq -csfasta reads_file -qual quality_file > DIR/reads.csfastq The name of the csfastq files must be properly chosen especially if you want to assemble SOliD paired-end data. For instance the sample1 could be named: brain-replicaa-1.csfastq and brain-replicaa-2.csfastq that allow to discriminate the two sequenced end files by the only differences 1 and 2 in the names (note that this different character must be at the end of the filename while the suffix remains unmodified). In the same way a second replicas could be named: brain-replicab-1.csfastq and brain-replicab-2.csfastq etc This will allow the alphanumerical sorting of the the pair-end reads. Check the log of this steps 1 and 3 to verify that the files were correctly associated. If the reads are not paired-ended, then there are no restrictions on the file names. Parameters for general setting -step (vector<int>) Set the steps to be performed -bin (string) Set the directory path where binaries are located [bin/] -tmp_dir (string) Set the temporary directory where results will be saved [tmp/]. -file_exten (string) Set the extension of input read files. Example "-file_esten.csfastq" reads_path (string) Set the directory containing the SOLiD reads in CSFASTQ format. Step 1 Making double encoded reads for assembling (1)Skip this step if you have already have a double encoded assembly to be translated. This step has the only purpose to produce the double encoded reads to be assembled with the assembly pipelines (for instance Oases). Depending on the hardware, users can tune the amount of reads to be assembled also considering quality and trimming. The file 1.de that will be saved in the STEP1 directory (inside the temporary directory) will contain the double encoded reads. This file will not be considered for translation purpose (see STEP3). (2) You need to create the directory containing the csfastq files or the symbolic links to these files and then you should pass the directory path with the "- reads_path" parameter. (3) You need to indicate the extension of the csfastq files with "-file_esten" parameter, usually -file_esten csfastq Parameters for Step 1 (*) -reads_path (string) directory containing the SOLiD reads in CSFASTQ format

5 (*) -file_esten (string) extension of read files. Example "-file_esten.csfastq" -max_reads (float) Max number of reads per analyzed file or pair of files [10] -tags (string,string) pair-end tag names for assembling purpose. It enables paired-end management (-t1) (tag examples: F3, F5-RNA...) -t1 (int) it trims the first sequenced end at 3' (if paired-end) [0] -t2 (int) it trims the second sequenced end at 3' [0] -q (int) minimum mean quality tolerated for paired_end sequences [15] -len (int) minimum read size after trimming [30] -mate-pair The sequences coming from mate pair libraries will be managed as paired-end (for assembling purpose) [disabled] Important notice: for paired-end libraries the trimming function needs both parameters -t1 and -t2 set. Non paired-end libraries require only the setting of parameter -t2, so the same trimming will be applied to all files. (*) required input. Step 2 Oases pipeline processing (1)This step executes the Oases pipeline. In the bin directory the bin/config file contains the basic setting for Velvet and Oases. You can edit and modify this file to change the settings but some parameters and the output will be managed automatically by the pipeline. Before running this step you must set the paths for Velvet and Oases binaries using -velvet_path and -oases_path parameters. Remember that Oases and Velvet must be compiled using the color option. This is the example for Oases: make color 'VELVET_DIR=/full_path_of_velvet_dir/' 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1' and this is the other one for Velvet: make color 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1' Parameters for Step 2 -velvet_path (string) path to velvet binaries - example: path/velvet/ -oases_path (string) path to Oases binary - example: path/oases/ -strand_specific Velvet will be set considering specific strand -kmer_set (vector<int>) Set the kmer to be considered. [ ] -oases_kmer (int) Oases kmer parameter [27] Step 3 Double encoding for translation (1)Firstly you need to create the directory containing the csfastq files or the symbolic links to these files and then you should pass the directory path to the "-reads_path" parameter. If your data are in the native color space format you can use the program "csfasta_to_fastq" inside this package to convert them in the csfastq format. Example of conversion: "csfasta_fastq -csfasta reads_file -qual quality_file > DIR/reads.csfastq" (2) You need to indicate the extension of the csfastq files with "-file_esten" parameter

6 Parameters for Step 3 (*) -reads_path (string) directory containing the SOLiD reads in CSFASTQ format (*) -file_esten (string) extension of read files. Example "-file_esten.csfastq" -T2 (int) it trims sequences at 3' end [0] -Q (int) minimum mean quality for reads [9] -len (int) minimum read size after trimming [30] important note: if -fasta parameter is not specified the steps 1 and 2 are required. (*) required input. Step 4 Color space translation (1)This step executes the color space translation and requires two main input: the output of STEP 3 and the file path of the color space assembly in FASTA format. The last information can be set using the -fasta parameter. Parameters for Step 4 (*) -fasta (string) Double encoded color space assembly in FASTA format. -l (int) Minimum contig length [100] -n (float) Maximum tolerated fraction of Ns for each translated contig[1]. -c (int) Minimum coverage required to operate the assembly correction If this parameter is used -z will be not considered. -erode (int) Minimum coverage considered to erode contig ends [2] -z (float) z-score required to calculate the coverage threshold basing on the statistical analysis of the sequence coverage [3]. Low values are more conservative when the error correction is applied. As consequence of this fact Ns will be introduced around color incoherence not supported by enough sequence coverage. -erosion it doesn't erodes contig ends in any way (*) required input if steps 1 and 2 are not executed. Output A temporary directory will be created in the directory where the pipeline is running. All results will be saved in the "-tmp_dir" path. Inside this directory other STEP* directory will be created and the translated assembly will be saved into the file STEP4/translated.fa. The files STEP4/clusters.* represent the output produced in the transcript clustering process. The file STEP4/translated_clustered_transcrips.fa represents the final output and the file STEP4/STEP4.log will contain some statistics about the color space translation. CONFIG file description It is possible to tune the settings of each program by modifying the bin/config file. Important notice: The row order follows the order of executions! If you erase a

7 row the associated command will not be executed. We strongly suggest to modify only the field number 4. Field meaning Field 1: Referred analysis step Field 2: Referred loop or sub-step inside the analysis Field 3: binary name to be executed Field 4: base setting that doesn't need changes during the analysis or loops Default configuration # Multi-kmrs loop STEP2 1 velveth_de STEP2 1 velvetg_de read_trkg yes -min_contig_lgth 100 STEP2 1 oases_de - min_trans_lgth 100 # Merging kmer assembly STEP2 2 velveth_de STEP2 2 velvetg_de read_trkg yes -conservelong yes -min_contig_lgth 100 STEP2 2 oases_de - merge yes -min_trans_lgth 100 # color space translation STEP4 3 pass - double_encoded -fid 90 -sam -query_size 300 -b -g 3 - pst_word_range 6 6 STEP4 3 cs2bs_assembly - clean 30 # Remove the following rows for DNA-seq translation STEP4 3 cd-hit-est - T 0 -M g 1 STEP4 3 fasta_remove -l 50 -f 0.1 -oases

Exeter Sequencing Service

Exeter Sequencing Service A guide to your denovo RNA-seq results An overview Once your results are ready, you will receive an email with a password-protected link to them. Click the link to access your