Manual of SOAPdenovo-Trans-v1.03. Yinlong Xie, Gengxiong Wu, Jingbo Tang,

Similar documents
Introduction and tutorial for SOAPdenovo. Xiaodong Fang Department of Science and BGI May, 2012

1 Abstract. 2 Introduction. 3 Requirements

De novo sequencing and Assembly. Andreas Gisel International Institute of Tropical Agriculture (IITA) Ibadan, Nigeria

Omega: an Overlap-graph de novo Assembler for Metagenomics

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

User Manual. This is the example for Oases: make color 'VELVET_DIR=/full_path_of_velvet_dir/' 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1'

ABySS. Assembly By Short Sequences

Performance of Trinity RNA-seq de novo assembly on an IBM POWER8 processor-based system

Examining De Novo Transcriptome Assemblies via a Quality Assessment Pipeline

Taller práctico sobre uso, manejo y gestión de recursos genómicos de abril de 2013 Assembling long-read Transcriptomics

Sequence Analysis Pipeline

Instruction Manual for the Ray de novo genome assembler software

RNA- SeQC Documentation

Meraculous De Novo Assembly of the Ariolimax dolichophallus Genome. Charles Cole, Jake Houser, Kyle McGovern, and Jennie Richardson

Next Generation Sequencing Workshop De novo genome assembly

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Genome Assembly and De Novo RNAseq

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Genome Assembly: Preliminary Results

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

Gap Filling as Exact Path Length Problem

see also:

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

IDBA - A practical Iterative de Bruijn Graph De Novo Assembler

Long Read RNA-seq Mapper

Assembly of the Ariolimax dolicophallus genome with Discovar de novo. Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves

RNA-seq. Manpreet S. Katari

NCGAS Makes Robust Transcriptome Assembly Easier with a Readily Usable Workflow Following de novo Assembly Best Practices

High-throughout sequencing and using short-read aligners. Simon Anders

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler

User's Guide to DNASTAR SeqMan NGen For Windows, Macintosh and Linux

Performance analysis of parallel de novo genome assembly in shared memory system

Read Mapping. Slides by Carl Kingsford

Read Naming Format Specification

Next generation sequencing: de novo assembly. Overview

RESEARCH TOPIC IN BIOINFORMANTIC

CLC Server. End User USER MANUAL

Tutorial: De Novo Assembly of Paired Data

ABSTRACT USING MANY-CORE COMPUTING TO SPEED UP DE NOVO TRANSCRIPTOME ASSEMBLY. Sean O Brien, Master of Science, 2016

Ensembl RNASeq Practical. Overview

Practical Bioinformatics for Life Scientists. Week 4, Lecture 8. István Albert Bioinformatics Consulting Center Penn State

Introduction to Genome Assembly. Tandy Warnow

Genome 373: Genome Assembly. Doug Fowler

Short Read Sequencing Analysis Workshop

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

Bioinformatics in next generation sequencing projects

Exeter Sequencing Service

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

Browser Exercises - I. Alignments and Comparative genomics

De novo genome assembly

KisSplice. Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data. 29th may 2013

1. Download the data from ENA and QC it:

Genomic Finishing & Consed

Reducing Genome Assembly Complexity with Optical Maps

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

RNA-Seq data analysis software. User Guide 023UG050V0200

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

RNA-seq Data Analysis

SMALT Manual. December 9, 2010 Version 0.4.2

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

Importing sequence assemblies from BAM and SAM files

BLAST & Genome assembly

Read Mapping and Assembly

NGS FASTQ file format

Tutorial. Small RNA Analysis using Illumina Data. Sample to Insight. October 5, 2016

Small RNA Analysis using Illumina Data

Tutorial. Aligning contigs manually using the Genome Finishing. Sample to Insight. February 6, 2019

Mapping Reads to Reference Genome

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software:

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

Single/paired-end RNAseq analysis with Galaxy

ChIP-seq hands-on practical using Galaxy

Hybrid Parallel Programming

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015

ChIP-seq hands-on practical using Galaxy

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

RNA-Seq data analysis software. User Guide 023UG050V0210

m6aviewer Version Documentation

Illumina Next Generation Sequencing Data analysis

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

RASER: Reads Aligner for SNPs and Editing sites of RNA (version 0.51) Manual

Reducing Genome Assembly Complexity with Optical Maps

de Bruijn graphs for sequencing data

User's guide to ChIP-Seq applications: command-line usage and option summary

Tutorial for Windows and Macintosh. De Novo Sequence Assembly with Velvet

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report

Tutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures

MinHash Alignment Process (MHAP) Documentation

NGI-RNAseq. Processing RNA-seq data at the National Genomics Infrastructure. NGI stockholm

Tiling Assembly for Annotation-independent Novel Gene Discovery

Data Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Data Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis

Tutorial: RNA-Seq analysis part I: Getting started

Sequence mapping and assembly. Alistair Ward - Boston College

Transcription:

Manual of SOAPdenovo-Trans-v1.03 Yinlong Xie, 2013-07-19 Gengxiong Wu, 2013-07-19 Jingbo Tang, 2013-07-19 ********** Introduction SOAPdenovo-Trans is a de novo transcriptome assembler basing on the SOAPdenovo framework, adapt to alternative splicing and different expression level among transcripts.the assembler provides a more accurate, complete and faster way to construct the full-length transcript sets. System Requirement SOAPdenovo-Trans aims for the transcript assembly. It runs on 64-bit Linux systems. For animal transcriptomes like mouse, about 30-35GB memory would be required. Update Log 1.03 2013-07-19 12:00:00 +0800 (Fri, 19 Jul 2013) Add the function: calculate RPKM (Reads per Kilobase of assembled transcripts per Million mapped reads). Installation 1. You can download the pre-compiled binary according to your platform, unpack and execute directly. 2. Or download the source code, unpack to ${destination folder} with the method above, and compile by using GNU make with command "sh make.sh" at ${destination folder} and generate the executable files "SOAPdenovo-Trans-31mer" and "SOAPdenovo-Trans-127mer". How to use it 1. Configuration file The configuration file in SOAPdenovo-Trans is mostly the same as SOAPdenovo, but there is no "rank" parameter. The configuration file tells the assembler where to find these files and the relevant information. "example.config" demonstrates how to organize the information and make configuration file. The configuration file has a section for global information, and then multiple library sections. Right now only "max_rd_len" is included in the global information section. Any read longer than max_rd_len will be cut to this length. The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Each library section starts with tag [LIB] and includes the following items: 1) avg_ins This value indicates the average insert size of this library or the peak value position in the insert size distribution figure.

2) reverse_seq This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed. Illumima GA produces two types of paired-end libraries: a) forwardreverse, generated from fragmented DNA ends with typical insert size less than 500 bp; b) reverseforward, generated from circularizing libraries with typical insert size greater than 2 Kb. The parameter "reverse_seq" should be set to indicate this: 0, forward-reverse; 1, reverse-forward. 3) asm_flags This indicator decides in which part(s) the reads are used. It takes value 1(only contig assembly), 2 (only scaffold assembly), 3(both contig and scaffold assembly). 4) rd_len_cutof The assembler will cut the reads from the current library to this length. 5) map_len This takes effect in the "map" step and is the mininum alignment length between a read and a contig required for a reliable read location. The minimum length for paired-end reads and mate-pair reads is 32 and 35 respectively. The assembler accepts read file in three kinds of formats: FASTA, FASTQ and BAM. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) are belonging to a pair. In the configuration file single end files are indicated by "f=/path/filename" or "q=/path/filename" for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by "f1=" and "f2=". While paired reads in two fastq sequences files are indicated by "q1=" and "q2=". Paired reads in a single fasta sequence file is indicated by "p=" item. Reads in bam sequence files is indicated by "b=". All the above items in each library section are optional. The assembler assigns default values for most of them. If you are not sure how to set a parameter, you can remove it from your configuration file. 2. Get it started Once the configuration file is available, the simplest way to run the assembler is:./soapdenovo-trans all -s config_file -o output_prefix User can also choose to run the assembly process step by step as:./soapdenovo-trans pregraph -s config_file -K 31 -o outputgraph./soapdenovo-trans contig -g outputgraph./soapdenovo-trans map -s config_file -g outputgraph./soapdenovo-trans scaff -g outputgraph -F NOTE: SOAPdenovo-Trans has two versions: SOAPdenovo-Trans-31mer and SOAPdenovo-Trans- 127mer. 3. Options: SOAPdenovo-Trans all -s configfile -o outputgraph [-R -f -S -F] [-K kmer -p n_cpu -d kmerfreqcutoff -e EdgeCovCutoff -M mergelevel -L mincontiglen -t locusmaxoutput -G gaplendiff] -s <string> configfile: the config file of reads -o <string> outputgraph: prefix of output graph file name -g <string> inputgraph: prefix of input graph file names -R (optional) output assembly RPKM statistics, [NO] -f (optional) output gap related reads for SRkgf to fill gap, [NO] -S (optional) scaffold structure exists, [NO] -F (optional) fill gaps in scaffolds, [NO] -K <int> kmer (min 13, max 31/127): kmer size, [23] -p <int> n_cpu: number of cpu for use, [8]

-d <int> kmerfreqcutoff: kmers with frequency no larger than KmerFreqCutoff will be deleted, [0] -e <int> EdgeCovCutoff: edges with coverage no larger than EdgeCovCutoff will be deleted, [2] -M <int> mergelevel (min 0, max 3): the strength of merging similar sequences during contiging, [1] -L <int> mincontiglen: shortest contig for scaffolding, [100] -t <int> locusmaxoutput: output the number of transcripts no more than locusmaxoutput in one locus, [5] -G <int> gaplendiff: allowed length difference between estimated and filled gap, [50] 4. Output files These files are output as assembly results: *.contig contig sequence file *.scafseq scaffold sequence file There are some other files that provide useful information for advanced users, which are listed in Appendix B. 5. Parameter adjustment -K: The kmer value is always depended on data size and its transcript features. At the current stage, SOAPdenovo-Trans has two versions: 1. SOAPdenovo-Trans-31mer, which accepts odd Kmer value from 13 to 31, 2. SOAPdenovo-Trans-127mer, which accepts odd Kmer value from 13 to 127. Ordinarily, SOAPdenovo-Trans always assembles the RNA-seq data by small kmer (~35-mer) as some of the transcripts are in low expression level. -M: The parameter is used to adjust the strength of merging similar sequences (bubble structure in the de Bruijn Graph). The similar sequences may be caused by sequencing error, repeat sequence, paralogs and heterozygosis. -L: The parameter is sensitive to the accuracy and coverage rate of the assemble result. The larger value we set, the higher the accuracy and lower coverage result we get. For example in the Mouse data assembly, the accuracy and coverage rate are 56.05% and 99.03% respectively when "-L" is set to 50. The accuracy and coverage rate are 88.20% and 98.22% respectively when the value is 150. The accuracy is increased by 32.15%. However, the coverage rate is decreased by only 0.79%. -e: The parameter is always 1~3, default 2. It deletes the low-coverage edge. -t: In the software, sub-graph may be a group of transcripts. The parameter is to set the upper limit of the transcripts output from each sub-graph. 5 is the experiential value. If the complexity of transcriptome is high, increase it. If the user needs higher true positive, decrease it. -F: The parameter is optimal. It is set to utilize the gap filling step in SOAPdenovo-Trans. The user can use other gap-filling softwares instead. -R:

The parameter is optional. It is set to calculate RPKM (Reads per Kilobase of assembled transcripts per Million mapped reads). Note that the RPKM is calculated by unique reads base on Kmer mapping. The information could be found from the file "*.RPKM.Stat". Furthermore, the function will output two intermediate files: *. readinformation and *.scafstatistics. They record the information about the reads' locations on contigs and scaffolds. -S: The parameter is optional. It is set to skip the step of constructing the scaffold and fill gap directly. It is used to try new methods of gap filling for developers. Or, if users have assembled the data with (or without) gap filling, they can get the assemblies rapidly without (or with) gap filling on the premise of keeping all the output files after scaffolding. APPENDIX A: example.config #maximal read length max_rd_len=50 [LIB] #maximal read length in this lib rd_len_cutof=45 #average insert size avg_ins=200 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert size) map_len=32 #fastq file for read 1 q1=/path/**libnamea**/fastq_read_1.fq #fastq file for read 2 always follows fastq file for read 1 q2=/path/**libnamea**/fastq_read_2.fq #fasta file for read 1 f1=/path/**libnamea**/fasta_read_1.fa #fastq file for read 2 always follows fastq file for read 1 f2=/path/**libnamea**/fasta_read_2.fa #fastq file for single reads q=/path/**libnamea**/fastq_read_single.fq #fasta file for single reads f=/path/**libnamea**/fasta_read_single.fa #a single fasta file for paired reads p=/path/**libnamea**/pairs_in_one_file.fa APPENDIXA B: 1. Output files from the command "pregraph" a. *.kmerfreq Each row shows the number of Kmers with a frequency equals the row number. Note that those peaks of frequencies which are the integral multiple of 63 are due to the data structure. b. *.edge Each record gives the information of an edge in the pre-graph: length, Kmers on both ends,

average kmer coverage, whether it's reverse-complementarily identical and the sequence. c. *.prearc Connections between edges which are established by the read paths. d. *.vertex Kmers at the ends of edges. e. *.pregraphbasic Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc. 2. Output files from the command "contig" a. *.contig Contig information: corresponding edge index, length, kmer coverage, whether it's tip and the sequence. Either a contig or its reverse complementary counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file. b. *.Arc Arcs coming out of each edge and their corresponding coverage by reads. c. *.updated.edge Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one. d. *.ContigIndex Each record gives information about each contig in the *.contig: it's edge index, length, the index difference between its reverse-complementary counterpart and itself. 3. Output files from the command "map" a. *.pegrads Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning. b. *.readoncontig Reads' locations on contigs. Here contigs are referred by their edge index. However about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already. c. *.readingap This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds if "-F" is set. d. *. readinformation Reads' locations on contigs: read id, start position of read, contig id, start position of contig, the align length and orientation. 4. Output files from the command "scaff" a. *.newcontigindex Contigs are sorted according their length before scaffolding. Their new indexes are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links. b. *.links Links between contigs which are established by read pairs. New index are used. c. *.scaf_gap Contigs in gaps are found by contig graph outputted by the contiging procedure. Here new index are used. d. *.scaf Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others contigs. e. *.gapseq Gap sequences between contigs.

f. *.scafseq Sequences of each scaffolds. g. *.contigposinscaff Contigs' positions in each scaffold. h. *.readonscaf Reads' locations on each scaffold: read id, start position of read, start position of scaffold, orientation and align length in order. i. *.scafstatistics Statistic information of final scaffold and contig. j. *.RPKM.Stat RPKM information: transcript ID, Transcript Length, Unique reads number, RPKM value.