Introduction and tutorial for SOAPdenovo. Xiaodong Fang Department of Science and BGI May, 2012

Size: px

Start display at page:

Download "Introduction and tutorial for SOAPdenovo. Xiaodong Fang Department of Science and BGI May, 2012"

Gwendolyn McBride
5 years ago
Views:

1 Introduction and tutorial for SOAPdenovo Xiaodong Fang Department of Science and BGI May, 2012

construct a reference genome sequence for an newly sequenced organism Identify genes and pathways that are difficult to study biochemically Study every gene in the pathway of interest Of course, this

2 Why de novo assembly? Genome is the genetic basis for different phenotypes Getting the reference genome is the first and necessary step to study an organism genome-wide in more details De novo assembly is the process to construct a reference genome sequence for an newly sequenced organism Identify genes and pathways that are difficult to study biochemically Study every gene in the pathway of interest Of course, this depends upon figuring out what genes are involved in a given pathway. Study non-coding regions of the genome Introns, promoters, telomeres, etc. We probably are not yet aware of all regulatory and structural features found in genomes Provide large databases that are amenable to statistical methods Identify variant sequences that may have subtle phenotypes Study evolution of the organism and genome 2

3 Evolution of sequencing technology Sequence technology Representative sequencing instrument Time to market Read length (bp) The first generation AB The second generation illumina GA The third generation PacBio/Nanopore 2011/? 10K/100K NGS: Next generation sequencing or Now generation sequencing Platform: 454, Illumina, SOLiD High throughput, cost-effective, short read length? (100 bp for Illumina) SOAPdenovo is originally designed for Illumina data

4 What is genome assembly? Sequence assembly refers to aligning and merging fragments to a much longer DNA sequence in order to reconstruct the original sequence. Overlap: contig Ge+en+no+om+mi+ic+cs Genomics Paired-end: scaffold nom sem Genome****assembly Genome assembly

Two strategies for sequencing and assembly BAC-by-BAC:

merge and remove redundancy to get the reference

the chromosomal DNA into fragments and then sequence

BAC-by-BAC Whole genome shotgun BAC-by-BAC Complex,

computation High cost and high quality Rarely used

5 Two strategies for sequencing and assembly BAC-by-BAC: sequence and assemble each BAC independently, then merge and remove redundancy to get the reference genome sequence Whole genome shotgun: Randomly break the chromosomal DNA into fragments and then sequence and assembly at a time. BAC-by-BAC Whole genome shotgun BAC-by-BAC Complex, time-consuming and laborintensive Low complex in computation High cost and high quality Rarely used Whole-genome shotgun Easy and fast on experiment step Difficult on computation step Cost-effective Widely used 5

6 Algorithms for de novo assembly Greedy method (SSAKE, SHARCGS, VCAKE) Start with given reads or contigs, and the basic operation is repeated until no more operations are possible. Each operation uses the next highest scoring overlap to make the next join. Overlap-Layout-Consensus (Phrap, Newbler, popular for long reads) 1. Overlap discovery involves all-against-all, pair-wise read comparison. 2. Construction an approximate read layout according to the pair-wise alignment 3. Multiple sequence alignment determines the precise layout and the consensus De bruijn graph (popular for illumina) All sequencing reads are split into a certain length of sequence (Kmer, K often range from 21~ 127 bp) The links between neighboring Kmers are derived from read sequences, so it doesn t need pair-wise reads alignment. The redundancy of data are automatically compressed Jason R. Miller et al., Assembly algorithms for next generation sequencing data. Genomics.

7 Algorithms for de novo assembly GACCTACAAGTTAG TACAAGTCCG Long reads Short reads

8 Challenges for assembly using short reads Complexity of the genome Repeat sequences Heterozygous diploid genome Polyploidy Data characteristic of Illumina reads Sequencing error (Illumina, error rate ~1%) Short read length (~100bp) High sequencing depth (~100X) Various ranks of insert size library (200bp ~ 40Kbp) Complexity of computation

9 Introduction of SOAPdenovo It is a novel short read assembler designed for huge genome size It employs de brijun graph algorithm It is the first assembler to assemble mammalian genome using short reads It has assembled hundreds of animal and plant genomes It is public available ( Li R, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Research (2010).

10 Published genomes using SOAPdenovo YH genome studies (NBT) Panda Genome (Nature) Ant Genome (Science) Chinese Hamster Ovary (CHO)-K1 Cell Line (NBT) Macaque Genome NBT Naked Mole Rat Genome (Nature) Cucumber genome (NG) Potato genome (Nature) Parasite genomes (Nature & NG) Brassica rapa genome (NG) 2011 Pigeonpea (NBT)

11 SOAPdenovo pipeline Contiging Kmer-graph construction Graph simplification Tips removal Merging bubbles Solve tiny repeat Scaffolding Reads mapped to contigs Scaffolding iteratively from short to long insert PEs. Gap Filling

12 Kmer-graph construction Kmers are nodes in the graph and are generated from reads. The neighboring kmers are K-1 overlaping which generated from read sequences, so it doesn t need pair-wise reads alignment. Repeat sequences are compressed in the graph Reads : AGATCTTGTTATT GTTATTGATCTCC ATCTT TCTTG CTTGT TTGTT TGTTA AGATC GATCT GTTAT TGATC TTGAT ATTGA TATTG TTATT ATCTC TCTCC

13 Contig building by kmer-graph Tips and Bubbles: Sequencing errors or heterozygosis or Repeats with high sequence similarity will result in tips/bubbles in the graph Tiny repeat: Repeats are compressed in the graph and act as share edges for different paths, but can be resolved by reads across it After solving tips, bubbles and tiny repeats, we will get raw contig sequences. a b c d

14 Scaffold building by contig graph Reads are mapped onto contigs, connection between contigs are then established Repeat will introduce conflict information Repeat contigs are masked when scaffolding Various insert size of paired-end information is used to build contig graph step by step from short to long

15 Gap filling Contig N50 usually is short ( <3 Kb) but can be significantly improved after gap filling (i.e., >20 Kb) Most of the gaps are repeat relative sequences Reads locate at gaps can collected by their paired-end which uniquely map to the contig contig1 contig2 Gap reads Kmer graph local assembly contig1 contig2 Connection

16 Panda assembly statistics Step Pairedend insert size (bp) Sequence coverage (X) Physical coverage (X) N50 (bp) N90 (bp) Total length (bp) Initial contig 200~ , ,021,639,596 Scaffold 1 200~ ,648 7,780 2,213,848,409 Scaffold 2 Adding 2K ,150 45,240 2,250,442,210 Scaffold 3 Adding 5K , ,336 2,297,100,301 Scaffold 4 Adding10K 58 1,293 1,281, ,670 2,299,498,912 Final contig All 58 1,293 39,886 9,848 2,245,302,481 Scaffold N50 doubled by adding longer insert size libraries

17 Insert Size Sequencing strategy #Library Effective Coverage(X) Type of Sequencing 170/250bp 2 22 PE100/PE bp 2 15 PE100/PE bp 2 12 PE100/PE150 2kb 2 10 PE50/PE90 5kb 2 8 PE50/PE90 10kb 2 5 PE50/PE90 20kb 2 3 PE50/PE90 Total Note: larger Kmer size requires more sequencing coverage

Pooling strategy for complicated genome Pooling strategy can reduce the possibility of co-occurring of repeats and allele in a small pool while reduce the cost

18 Pooling strategy for complicated genome Pooling strategy can reduce the possibility of co-occurring of repeats and allele in a small pool while reduce the cost compare to BAC-by-BAC strategy. It is able to assemble organism: Wealth of repetitive sequences with high similarity High level of heterozygosis Polyploidy organisms

19 System Requirement for SOAPdenovo SOAPdenovo aims for large plant and animal genomes using short reads, although it also works well on bacteria and fungi genomes. It runs on 64-bit Linux system The memory required depends on the genome size and data quality and the K-size. It typical need 150 GB to assemble human genome

20 FASTQ file format NCGAGAGTTTTTGTTTCTCTCCATTCTCGTTCCCGGACCAGAGCATCCT + BMSMNVVXWW\^[VVUU[c c\cc Z_c NTGTAATTTGTTTCACGACATTTCGTATTTTGGGCGGGAATATTTCTTT + BYYYY[[[Z[cYYYccccccccccccccccYUccccYUUccccccccYY CTTGCAAGGGTGTATATTGTTTGATTATCAACTTCTCAGCATGATGTTA + AAGCAAGTCTTAATAGTTATAGCCACCAAGTCCTGTTCAAATCTTTTAC + gggggggggggggggggggeggggggggggggggggggegggegggeeg

21 Configuration file SOAPdenovo uses a configure file to record necessary information for assembly Data file name and path File format for the reads Size of the libraries Read length Rank and order to use the paired-end information when scaffolding Cutoffs in assembly

22 Configuration file

23 Configuration file The assembler accepts: FASTA or FASTQ. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair two adjacent reads in a single file (FASTA only) belonging to a pair. single end reads: f=/path/filename (fasta) q=/pah/filename (fastq) Paired reads in two fasta sequence files: f1= reads 1 f2= reads 2 paired reads in two fastq sequence files : q1= reads 1 q2= reads 2 Paired reads in a single fasta sequence file : p= /path/filename (fasta)

24 N50 (Mb) Rank: order of libraries to use during scaffolding Paired-end libraries were used to make connection between contigs with insert size from small to large. How to set library rank, we recommend: Configuration file 170/200/250 bp rank 1 350/500bp rank 2 800bp rank 3 2Kb rank 4 5Kb rank 5 10Kb rank 6 20Kb rank bp 500bp 800bp 2kb 5kb 10kb 20kb

25 Configuration file asm_flags: determine how to use a given set of data during assembly Assembling process is divided into three steps: Contig building Scaffold construction Gap closure asm_flags=1: data only used in contig building (i.e., 454 or sanger or merging reads) asm_flags=2: data only used in scaffold construction (i.e., mate-pair libraries) asm_flags=3: data used in contig building and scaffold construction (i.e., short insert libraries) asm_flags=4: data only used gap closure

26 Configuration file reverse-seq: indicate the orientation of two reads in a pair. (forward-reverse or forward-forward) reverse-seq=0: for short insert size library(<1kb) reverse-seq=1: for large insert size library (>2Kb)

27 Commands for SOAPdenovo A typical way (one line command):./soapdenovo all -s config_file -K 25 -o outpt_prefix Step by step:./soapdenovo pregraph -s config_file -K 25 [-R -d -p] -o output_prefix./soapdenovo contig -g output_prefix [-R -M 1 -D]./soapdenovo map -s config_file -g output_prefix [-p]./soapdenovo scaff -g output_prefix [-F -u -G -p]

28 Options for SOAPdenovo -s STR configuration file -o STR output files prefix -g STR input graph file prefix -K INT K-mer size [default 23] (range from 13 to 127) -p INT multithreads, n threads [default 8] -R use reads to solve tiny repeats [default no] -d INT remove low-frequency K-mers with frequency no larger than [default 0] (minimize the influence of sequencing errors) -D INT remove edges with coverage no larger than [default 1] (minimize the influence of sequencing errors) -M INT strength of merging similar sequences during contiging [default 1, min 0, max 3] (deal with heterozygosis) -F intra-scaffold gap closure [default no] -u un-mask high coverage contigs before scaffolding [default mask] -G INT allowed length difference between estimated and filled gap -L minimum contigs length used for scaffolding

29 Key parameters How to set K-mer size (option K )? The program accepts odd numbers range from 3 to 127. Larger K-mers would expect to resolve more repeats in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length but more sensitive to sequencing errors and heterozygosis. Smaller k-mer for heterozygous genome Larger k-mer for genomes with high proportion of repeats

30 Key parameters Other option: -R -d -D -M -R: resolve tiny repeat by reads, it is useful for genome with high proportion of repeats -d: remove low-frequency K-mers which usually result from sequencing error -D: delete edges with low coverage, this is also good for minimize assembling errors and reducing complexity of the graph -M: Heterozygous rate of the genome, it will be better set to be 3 if the heterozygosis higher than 0.3%.

31 Assembly results: Output files *.contig Contig sequences without using mate pair information *.scafseq Scaffold sequences, the final output from SOAPdenovo which can be used for further study

32 Output files from the command "pregraph" *.kmerfreq Each row shows the number of Kmers with a frequency equals the row number. *.edge Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, flag to indicate palindromic sequence or not and the sequence. *.markonedge & *.path These two files are for using reads to solve small repeats *.prearc Connections between edges which are established by the read paths. *.vertex Kmers at the ends of edges. Output files *.pregraphbasic Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.

33 Output files from the command "pregraph" *.kmerfreq Each row shows the number of Kmers with a frequency equals the row number. *.edge Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, flag to indicate palindromic sequence or not and the sequence. *.markonedge & *.path These two files are for using reads to solve small repeats *.prearc Connections between edges which are established by the read paths. *.vertex Kmers at the ends of edges. Output files *.pregraphbasic Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.

34 Output files from the command "pregraph" *.kmerfreq Each row shows the number of Kmers with a frequency equals the row number. *.edge Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, flag to indicate palindromic sequence or not and the sequence. *.markonedge & *.path These two files are for using reads to solve small repeats *.prearc Connections between edges which are established by the read paths. *.vertex Kmers at the ends of edges. Output files *.pregraphbasic Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.

35 Output files from the command "pregraph" *.kmerfreq Each row shows the number of Kmers with a frequency equals the row number. *.edge Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, flag to indicate palindromic sequence or not and the sequence. *.markonedge & *.path These two files are for using reads to solve small repeats *.prearc Connections between edges which are established by the read paths. *.vertex Kmers at the ends of edges. Output files *.pregraphbasic Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.

36 Output files from the command "pregraph" *.kmerfreq Each row shows the number of Kmers with a frequency equals the row number. *.edge Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, whether it's reversecomplementarily identical and the sequence. *.markonedge & *.path These two files are for using reads to solve small repeats *.prearc Connections between edges which are established by the read paths. *.vertex Kmers at the ends of edges. Output files *.pregraphbasic Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.

37 Output files from the command "pregraph" *.kmerfreq Each row shows the number of Kmers with a frequency equals the row number. *.edge Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, flag to indicate palindromic sequence or not and the sequence. *.markonedge & *.path These two files are for using reads to solve small repeats *.prearc Connections between edges which are established by the read paths. *.vertex Kmers at the ends of edges. Output files *.pregraphbasic Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.

Output files Output files from the command "contig *.contig Contig information: corresponding edge index, length, kmer coverage, tip flag and the sequence.

38 Output files Output files from the command "contig *.contig Contig information: corresponding edge index, length, kmer coverage, tip flag and the sequence. Either a contig or its reverse complementary counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file. *.Arc Arcs coming out of each edge and their corresponding coverage by reads *.updated.edge Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one. *.ContigIndex Each record gives information about each contig in the *.contig: it's edge index, length, the index difference between its reverse-complementary counterpart and itself.

39 Output files Output files from the command "contig *.contig Contig information: corresponding edge index, length, kmer coverage, whether it's a tip or not and the assembling sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file. *.Arc Arcs coming out of each edge and their corresponding coverage by reads *.updated.edge Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one. *.ContigIndex Each record gives information about each contig in the *.contig: it's edge index, length, the index difference between its reverse-complementary counterpart and itself.

Output files from the command "contig Output files *.contig Contig information: corresponding edge index, length, kmer coverage, whether it's tip and the sequence.

40 Output files from the command "contig Output files *.contig Contig information: corresponding edge index, length, kmer coverage, whether it's tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file. *.Arc Arcs coming out of each edge and their corresponding coverage by reads *.updated.edge Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one. *.ContigIndex Each record gives information about each contig in the *.contig: it's edge index, length, the index difference between its reverse-complementary counterpart and itself.

41 Output files from the command "contig Output files *.contig Contig information: corresponding edge index, length, kmer coverage, whether it's tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file. *.Arc Arcs coming out of each edge and their corresponding coverage by reads *.updated.edge Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one. *.ContigIndex Each record gives information about each contig in the *.contig: it's edge index, length, the index difference between its reverse-complementary counterpart and itself.

42 Output files Output files from the command "map" *.pegrads Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning. *.readoncontig Read locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already. *.readingap This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds.

43 Output files Output files from the command "map" *.pegrads Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning. *.readoncontig Read locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already. *.readingap This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds.

44 Output files Output files from the command "map" *.pegrads Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning. *.readoncontig Read locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already. *.readingap This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds.

45 Output files from the command "scaff" *.newcontigindex Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links. *.links Links between contigs which are established by read pairs. New index are used. *.scaf_gap Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used. *.scaf Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others. *.gapseq Gap sequences between contigs. *.scafseq Sequence of each scaffold. Output files

46 Output files from the command "scaff" *.newcontigindex Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links. *.links Links between contigs which are established by read pairs. New index are used. *.scaf_gap Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used. *.scaf Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others. *.gapseq Gap sequences between contigs. *.scafseq Sequence of each scaffold. Output files

47 Output files from the command "scaff" *.newcontigindex Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links. *.links Links between contigs which are established by read pairs. New index are used. *.scaf_gap Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used. *.scaf Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others. *.gapseq Gap sequences between contigs. *.scafseq Sequence of each scaffold. Output files

48 Output files from the command "scaff" *.newcontigindex Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links. *.links Links between contigs which are established by read pairs. New index are used. *.scaf_gap Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used. *.scaf Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others. *.gapseq Gap sequences between contigs. *.scafseq Sequence of each scaffold. Output files

49 Output files from the command "scaff" *.newcontigindex Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links. *.links Links between contigs which are established by read pairs. New index are used. *.scaf_gap Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used. *.scaf Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others. *.gapseq Gap sequences between contigs. *.scafseq Sequence of each scaffold. Output files

50 Output files from the command "scaff" *.newcontigindex Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links. *.links Links between contigs which are established by read pairs. New index are used. *.scaf_gap Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used. *.scaf Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others. *.gapseq Gap sequences between contigs. *.scafseq Sequence of each scaffold. Output files

51 Output files pregraph.log Contiguration file Available data

52 Output files contig.log Information of the graph from step 1

53 Output files contig.log Total size of contig Final information of the step 2

54 map.log Output files

55 Output files scaff.log Final information of the assembly

56 Data: Test Data for SOAPdenovo soapdenovo.test/ 00.bin/ SOAPdenovo 01.data/ illumina_*.fq.gz reference.fa.gz 02.assemble/ soapdenovo.sh soapdenovo.cfg

57 soapdenovo.sh Test Data for SOAPdenovo /path/../soapdenovo pregraph -s soapdenovo.cfg -K 33 -p 2 -R -o Soapdenovo_test >pregraph.log /path/../soapdenovo contig -g Soapdenovo_test -M 1 -R >contig.log /path/../soapdenovo map -g Soapdenovo_test -s soapdenovo.cfg -p 2 >map.log /path/../soapdenovo scaff -g Soapdenovo_test -F -p 2 >scaff.log

58 Test Data for SOAPdenovo soapdenovo.cfg max_len=100 [LIB] name=libaa avg_ins=500 reverse_seq=0 asm_flags=3 rank=1 q1=../01.data/illumina_100_500_libaa_1.fq.gz q2=../01.data/illumina_100_500_libaa_2.fq.gz [LIB] name=libab avg_ins=2000 reverse_seq=1 asm_flags=2 rank=2 q1=../01.data/illumina_50_2000_libab_1.fq.gz q2=../01.data/illumina_50_2000_libab_2.fq.gz

59 Thanks!

Manual of SOAPdenovo-Trans-v1.03. Yinlong Xie, Gengxiong Wu, Jingbo Tang,

Manual of SOAPdenovo-Trans-v1.03 Yinlong Xie, 2013-07-19 Gengxiong Wu, 2013-07-19 Jingbo Tang, 2013-07-19 ********** Introduction SOAPdenovo-Trans is a de novo transcriptome assembler basing on the SOAPdenovo