1. Download the data from ENA and QC it:

Size: px

Start display at page:

Download "1. Download the data from ENA and QC it:"

Vivien Hensley
6 years ago
Views:

1 GenePool-External : Genome Assembly tutorial for NGS workshop This page last changed on Oct 11, 2012 by tcezard. This is a whole genome sequencing of a E. coli from the 2011 German outbreak You can also find this tutorial online at: +Assembly+tutorial+for+NGS+workshop Download the data from ENA and QC it: 1.1 Download the data We'll use wget which is tool that can download file directly from the web. This is only one part of an Illumina HiSeq lane wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/srr341/srr341550/srr341550_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/srr341/srr341550/srr341550_2.fastq.gz 1.2 QC the reads fastqc -t 2 SRR341550_1.fastq.gz SRR341550_2.fastq.gz How does the quality of the reads look? How many reads do we have? What is the expected depth of coverage assuming the genome size is 5M. 1.3 Uncompress the fastq files gunzip SRR341550_1.fastq.gz gunzip SRR341550_2.fastq.gz 2. First pass assembling with velvet with k=41 When you run Velvet, you do so in two stages. First you ask velveth to make the hash table of all the k-mers at your chosen value of k. Then you ask velvetg to build the de Bruijn graph, and to output the sequences of the assembled contigs. Choosing a Kmer size can be a bit tricky an require trial and error. It depends on the expected depth of coverage and the amount of sequencing error. 2.1 Run velvet type velveth to see the expected structure of a velveth command and to identify the flags you should use. There are LOTS of options for velveth (and velvetg). Document generated by Confluence on Oct 12, :09 Page 1

2 velveth single_ short -fastq SRR341550_1.fastq then velvetg single_ Look at the results Velvet creates an output file contigs.fa each time... We will first change contigs.fa to a more informative name so that you can track what is going on mv single_41/contigs.fa single_41/k41_nocutoff.contigs.fa We use a script, contig_stats.pl, written by Sujai Kumar, to check the number of contigs, their span, and additional information such as N50, etc. contig_stats is a simple but very useful utility script. In the version below, it will process one input fasta file, will report data for all contigs greater than 100 bases (-t 100), and (-h) will output a "human" readable (rather than comma separated value) report. If you wanted data at two different minimum cutoffs, you would give more than one number after -t (eg -t ) contig_stats.pl -f single_41/k41_nocutoff.contigs.fa -t 100 -h What is the N50 when using only contig >100bases? How many bases are in the assembly? How many contigs have been generated? What do you think of the N50? 2.3: Tuning coverage expectations for Velvet k=41 - finding the most likely coverage Velvet works best if it is told what to expect in terms of coverage, and when to ignore low and high coverage contigs (errors and repeats respectively). Velvetg outputs a file "stats.txt". We can use the statistical programming language R to summarise these data. R data = read.table("single_41/stats.txt", header=true) install.packages("plotrix") library(plotrix) weighted.hist(data$short1_cov, data$lgth, breaks=0:100, xlab="coverage of node",ylab="frequency, number of nodes with given coverage") q() What is the most likely value for the expected coverage? Where would you set the coverage cut-off? Rerunning Velvet with chosen coverage values Document generated by Confluence on Oct 12, :09 Page 2

3 After looking at the stats.txt file (with R), run velvetg again adding the parameter -exp_cov and - cov_cutoff replacing XX and YY in the command below with your chosen value(s). velvetg single_41 -exp_cov XX -cov_cutoff YY then mv single_41/contigs.fa single_41/k41_expxxcutyy.contigs.fa How does the output change? (run contig_stats.pl) How long does it take to run? What do you think of the N Additional parameter exploration (optional) You could try using additional advanced options... (-max_coverage) 3 Paired end assembly with Velvet k=41 To do an efficient paired end assembly we need to know the insert size distribution. This is similar to how you estimated insert size in the variant calling tutorial but first we need to generate an alignment file. For this we'll use bwa to map the read back to the best assembly we have. 3.1 Align the read to the newly created genome Index the newly assembled genome (replace XX and YY with the values you chose) cd single_41 bwa index -a is k41_expxxcutyy.contigs.fa This will create several file that bwa will use to quickly lookup sequences in the assembled genome Then align the reads to the genome. This is done in 3 steps First align the forward and reverse reads independently (replace XX and YY with the values you chose) mkdir aligned_reads bwa aln -t 4 k41_expxxcutyy.contigs.fa../srr341550_1.fastq > aligned_reads/srr341550_1.sai bwa aln -t 4 k41_expxxcutyy.contigs.fa../srr341550_2.fastq > aligned_reads/6srr341550_2.sai Put the two alignments together (replace XX and YY with the values you chose) bwa sampe -A k41_expxxcutyy.contigs.fa aligned_reads/srr341550_1.sai aligned_reads/srr341550_2.sai../ SRR341550_1.fastq../SRR341550_2.fastq > aligned_readsxxyy/srr sam Document generated by Confluence on Oct 12, :09 Page 3

4 The text file in sam format needs to be converted into binary (replace XX and YY with the values you chose) samtools view -bt k41_expxxcutyy.contigs.fa aligned_reads/srr sam > aligned_reads/srr bam Then you need to sort the alignment by coordinates samtools sort aligned_reads/srr bam aligned_reads/srr341550_sorted Here is a trick to do all these steps at the same time (remember to change XX and YY with the values you chose). It uses a feature of bash called anonymous pipes. All command in parentesis get their output "piped" as an anonymous file which the next tool can read. This technique only works if the input files are read once and only once. bwa sampe -A k41_expxxcutyy.contigs.fa <(bwa aln -t 4 k41_expxxcutyy.contigs.fa../srr341550_1.fastq) <(bwa aln -t 4 k41_expxxcutyy.contigs.fa../srr341550_2.fastq)../srr341550_1.fastq../ SRR341550_2.fastq samtools view -bt k41_expxxcutyy.contigs.fa - samtools sort - aligned_reads/ SRR341550_sorted 3.2 Measure insert size distribution java -jar $PICARD_PATH/CollectInsertSizeMetrics.jar I=aligned_reads/SRR341550_sorted.bam O=aligned_reads/SRR341550_insert_size.hist H=aligned_reads/SRR341550_insert_size.pdf VALIDATION_STRINGENCY=SILENT What do you estimate the mean insert size and SD to be? 3.3 Non optimized paired end assembly We will start off with a simple velvet run (velveth+velvetg) with kmer 21 using 'shortpaired_41' as the directory parameter. velveth shortpaired_ fastq -shortpaired -separate SRR341550_1.fastq SRR341550_2.fastq velvetg shortpaired_41 mv shortpaired_41/contigs.fa shortpaired_41/k41.pe.contigs.fa Are there differences in the coverage distribution in Paired end versus single end assemblies? R data = read.table("shortpaired_41/stats.txt", header=true) library(plotrix) weighted.hist(data$short1_cov, data$lgth, breaks=0:50, xlab="coverage of node",ylab="frequency, number of nodes with given coverage") Document generated by Confluence on Oct 12, :09 Page 4

5 q() What is the most likely value for the expected coverage? How does this compare to your single-end assembly? Where would you set the coverage cut-off for the paired end assembly? 3.4 paired end assembly at k=41, with ins_length and other parameters set Run the small paired set again with velvetg adding the parameter -ins_length. You should specify the following parameters: -cov_cutoff, -exp_cov and -ins_length. We have indicated some reasonable guesses of these below, but do check your data. velvetg shortpaired_41 -exp_cov 120 -cov_cutoff 60 -ins_length 200 -ins_length_sd 25 mv shortpaired_41/contigs.fa shortpaired_41/k41_pe_exp120_cut60_ins200_sd25_contigs.fa How does the output change? (run contig_stats.pl) contig_stats.pl -f single_41/k41_exp60cut30.contigs.fa shortpaired_41/ k41.pe.exp120.cut60.ins200.sd25.contigs.fa -t 100 -h Is the paired-end assembly BETTER than the single-end one? What statistics of the paired end assembly are better than the single ended one? Which statistics are worse? 4. Changing k value What do you think of the k value we've tried? too small or too big? Try another pair end assembly with another k value and see how the results change. You'll need to restart from velveth Document generated by Confluence on Oct 12, :09 Page 5

Next Generation Sequencing Workshop De novo genome assembly

Next Generation Sequencing Workshop De novo genome assembly Tristan Lefébure TNL7@cornell.edu Stanhope Lab Population Medicine & Diagnostic Sciences Cornell University April 14th 2010 De novo assembly