Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Size: px

Start display at page:

Download "Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page."

Richard Adams
5 years ago
Views:

1 Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your results, choose explicitly filenames. MAPHiTS is a pipeline developed for SNP discovery after mapping short-reads on a reference genome. This pipeline is currently running with the following public tools "BWA or Bowtie", "Samtools" and "VarScan". The input data files are : a fasta file for the reference genome (Genome.fasta) and 2 fastq files of short-reads sequenced in paired-ends and corresponding to the forward (SR_1.fastq) and the reverse (SR_2.fastq) sequences. Import "input data" in your current history: Embedded Galaxy Dataset 'Genome.fasta' [Do not edit this block; Galaxy will fill it in with the annotated dataset when it is displayed.] Embedded Galaxy Dataset 'SR_2.fastq' [Do not edit this block; Galaxy will fill it in with the annotated dataset when it is displayed.] Embedded Galaxy Dataset 'SR_1.fastq' [Do not edit this block; Galaxy will fill it in with the annotated dataset when it is displayed.] Rename your datasets : select "Edit Attributes" Genome.fasta SR_1.fastq (1250 sequences) => forward SR_2.fastq (1250 sequences) => reverse Page 1 sur 22

2 Data pre-processing Step1 : Remove extra informations in each header of genome fasta file This URGI tool removes all informations written in each header of references sequences. Use [URGI-MAPHiTS-PreProcess Tools] => Header Fasta Filter on input file Genome.fasta. Rename output file : Genome Header Filtered (fasta file) Step2 : Remove duplicates in short reads files This URGI tool removes short-reads in duplicate (same sequence) in fastq file. Use [URGI-MAPHiTS-PreProcess Tools] => Remove Duplicate Short Reads on input files SR_1.fastq and SR_2.fastq Page 2 sur 22

3 Rename output files : RemoveDuplicateSR1.fastq (1229 forward sequences) RemoveDuplicateSR2.fastq (1246 reverse sequences) Step3 : Remove short reads > N % This URGI tool removes all short-reads with a rate of N greater than a threshold in fastq file. Use [URGI-MAPHiTS-PreProcess Tools] => Remove short reads > N% tools on input files RemoveDuplicateSR1.fastq and RemoveDuplicateSR2.fastq : => set max % of N authorized per sequence at 25. Rename output files: FilterN25% SR1.fastq (1228 forward sequences) Page 3 sur 22

4 FilterN25% SR2.fastq (1245 reverse sequences) #1 removed sequence in GGAAATACTAACTANANNNNNNNNNNNNNNNNNNNN #1 removed sequence in NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Step4 : Remove short reads not paired This URGI tool removes all short-reads not in paired in fastq files. Uses [URGI-MAPHiTS-PreProcess Tools] => Remove Short Reads Not Paired on input files FilterN25% SR1.fastq and FilterN25% SR2.fastq Rename output files: RemoveShortReadsNotPaired SR1 (1224 forward sequences) RemoveShortReadsNotPaired SR2 (1224 reverse sequences) * File1: RemoveShortReadsNotPaired SR1 INPUT Short Reads: 1228 REMOVED Short Reads: 4 OUTPUT Short Reads: 1224 * File2: RemoveShortReadsNotPaired SR2 INPUT Short Reads: 1245 REMOVED Short Reads: 21 OUTPUT Short Reads: 1224 Mapping reads with BWA or Bowtie Step1 : Normalize quality scores from differents sequencing methods (SOLID, ILLUMINA, 454) with FASTQ Groomer tool The FASTQ Groomer tool is used to verify and convert between the known FASTQ variants. The valid FASTQ output format created by this tool is accepted by all analysis tools ( "NGS: QC and manipulation", mapping tools,...). For more information about FASTQ Groomer tool see here. Use [NGS: QC and manipulation] => FastQ Groomer on input SR files RemoveShortReadsNotPaired SR1 and Page 4 sur 22

RemoveShortReadsNotPaired SR2 Rename output files : FASTQ Groomer Illumina on SR1 (1224 forward sequences) FASTQ Groomer Illumina on SR2 (1224 reverse sequences) Step2 : Mapping reads with BWA BWA is

5 RemoveShortReadsNotPaired SR2 Rename output files : FASTQ Groomer Illumina on SR1 (1224 forward sequences) FASTQ Groomer Illumina on SR2 (1224 reverse sequences) Step2 : Mapping reads with BWA BWA is a fast light-weighted tool that aligns relatively short sequences to a reference genome. It is a fast and accurate short read aligner that allows mismatches and indels. There are several options you can configure in Bwa. One of the most important is how many mismatches you will allow between a read and a potential mapping location for that location to be considered a match. The default is 4% of the read length. It is developed by Heng Li at the Sanger Insitute (Li H. and Durbin R., 2009). Use [URGI: MAPHiTS - Tools] => Map with BWA for ILLUMINA For more information on BWA setting parameters see "BWA parameter list" at the bottom of the Galaxy tool page. Page 5 sur 22

6 Rename output file : Map with BWA for Illumina SR1 & SR2 and Genome Header Filtered => SAM file For more information on SAM format see the "Output" description at the bottom of the Galaxy tool page. Step3 : Mapping reads with Bowtie Bowtie is an ultrafast, memory-efficient short read aligner geared toward quickly aligning large sets of short DNA sequences (reads) to large genomes. Bowtie is designed to be extremely fast for sets of short reads where (a) many of the reads have at least one good, valid alignment, (b) many of the reads are relatively high-quality, and (c) the number of alignments reported per read is small (close to 1). Bowtie does not report gapped alignments, i.e. it does not handle insertion/deletion well. It is developed by Ben Langmead and Cole Trapnell (Genome Biology 10:R25). Use [URGI: MAPHiTS - Tools] => Map with Bowtie for ILLUMINA For more information on Bowtie setting parameters see the documentation on the Galaxy tool page. Page 6 sur 22

7 Rename output file : Map with Bowtie for Illumina SR1 & SR2 and Genome Header Filtered => SAM file For more information on SAM format see the "Output" description at the bottom of the Galaxy tool page. Step4 : Reads mapped/unmapped with BWA This tool allows parsing of SAM datasets using bitwise flag (the second column of the SAM file). For more information see the documentation on the Galaxy tool page. The SAM flags is explained at " #Use [URGI: MAPHiTS - PostProcess Tools] => Filter Sam on bitwise flag values on input file Map with BWA for Illumina SR1 & SR2 and Genome Header Filtered (SAM file) => Flag 1 Type: "The read is unmapped" and Set the states for this flag is "Yes". Page 7 sur 22

8 Rename ouput file : Filter SAM: keep unmapped SR (BWA Mapping) => 224 SR are unmapped. #Use [URGI: MAPHiTS - PostProcess Tools] => Filter Sam on bitwise flag values on input file Map with BWA for Illumina SR1 & SR2 and Genome Header Filtered (SAM file) => Flag 1 Type: "The read is unmapped" and Set the states for this flag is "No". Rename ouput file : Page 8 sur 22

9 With Bwa : Filter SAM: keep mapped SR (BWA Mapping) => 2224 SR are mapped SR are mapped 224 SR are unmapped Step5 : Reads mapped/unmapped with Bowtie #Use [URGI: MAPHiTS - PostProcess Tools] => Filter Sam on bitwise flag values on input file Map with Bowtie for Illumina SR1 & SR2 and Genome Header Filtered (SAM file) => Flag 1 Type: "The read is unmapped" and Set the states for this flag is "Yes". Rename ouput file : Filter SAM: keep unmapped SR (Bowtie Mapping) => 374 SR are unmapped. #Use [URGI: MAPHiTS - PostProcess Tools] => Filter Sam on bitwise flag values on input file Map with Bowtie for Illumina SR1 & SR2 and Genome Header Filtered (SAM file) => Flag 1 Type: "The read is unmapped" and Set the states for this flag is "No". Page 9 sur 22

10 Rename ouput file : Filter SAM: keep mapped SR (Bowtie Mapping) => 2074 SR are mapped. With Bowtie : 2074 SR are mapped 374 SR are unmapped Step6 : Count multiple hits from the results of Bwa This URGI tool counts multiple hits from the results of Bwa. Use [URGI: MAPHiTS - PostProcess Tools] => Count multiple hits from the results of Bwa on input file Map with BWA for Illumina SR1 & SR2 and Genome Header Filtered (SAM file) Rename ouput file : CountMultipleHitsBwa Step7 : Statistics on SAM/BAM output files This tool uses the SAMTools toolkit to produce simple statistics on a SAM or BAM file. Use [URGI-MAPHiTS - Tools] => flagstat provides simple stats on BAM files on input file Map with BWA for Illumina SR1 & SR2 and Genome Header Filtered (SAM file) Page 10 sur 22

Rename output file : flagstat on SAM mapping with BWA FlagStat Results: 2448 in total => Total SR count 0 QC failure 0 duplicates => because MAPHiTS preprocess "Remove Duplicate Short Read" 2224

85%) => Total SR mapped 2448 paired in sequencing => Total Mate Count (equals to total SR count because MAPHiTS Preprocess "Remove Short Reads Not Paired") 1224 read1 (forward sequence) 1224 read2

11 Rename output file : flagstat on SAM mapping with BWA FlagStat Results: 2448 in total => Total SR count 0 QC failure 0 duplicates => because MAPHiTS preprocess "Remove Duplicate Short Read" 2224 mapped (90.85%) => Total SR mapped 2448 paired in sequencing => Total Mate Count (equals to total SR count because MAPHiTS Preprocess "Remove Short Reads Not Paired") 1224 read1 (forward sequence) 1224 read2 (reverse sequence) => count Reverse == count Forward 2188 properly paired (89.38%) => count SR mapped in proprer pair. Proper pair mapping is: --> < with itself and mate mapped => count SR mapped in pair: proper pair + not proper pair: --> < > --> + <-- <-- 16 SR mapped not IN proper pair 20 singletons (0.82%) => a singleton is SR mapped not in pair. 14 with mate mapped to a different chr => include in not proper pair set? 14 with mate mapped to a different chr (mapq>=5) Total SR Not Mapped = Total SR (2448) - Total SR Mapped (2224) = 224 unmapped Step8 : Convert SAM file to BAM file This tool uses the SAMTools toolkit to produce an indexed BAM file based on a sorted input SAM file. Use [URGI: MAPHiTS - Tools] => SAM-to-BAM converts SAM format to BAM format on input file Map with BWA for Illumina SR1 & SR2 and Genome Header Filtered (SAM file) Rename output file : Page 11 sur 22

12 SAM-to-BAM on Map with BWA (BAM file) Remark : you can use FlagStat directly on the BAM file SAM-to-BAM on Map with BWA. SNP Calling Step1 : SNP calling with Mpileup This tool generates BCF (Binary Call Format) or pileup for one or multiple BAM files. Remark : the Mpileup output format is : chromosome / coordinate / reference base / number of reads covering these position / alleles seen at that position / base quality per each base Use [URGI MAPHiTS - Tools] => MPileup SNP and indel caller on input file SAM-to-BAM on Map with BWA (BAM file) Rename output files : MPileup on BAM and reference genome MPileup on BAM and reference genome.log Step2 : Reformat the Mpileup SNP calling file with VarScan This tool is able to predict SNPs and small Indels. Use [URGI MAPHiTS - Tools]=> Varscan: VarScan analysis on input file MPileup on BAM and reference genome Page 12 sur 22

13 Rename output files : VarScan.results VarScan.resume Here is an history with these results : Embedded Galaxy History 'TP_MAPHITS_Part1' [Do not edit this block; Galaxy will fill it in with the annotated history when it is displayed.] MAPHITS post Process Tools Import 2 new VarScan datasets Embedded Galaxy Dataset 'Vitis1_chr1_VarScan' Page 13 sur 22

14 [Do not edit this block; Galaxy will fill it in with the annotated dataset when it is displayed.] Embedded Galaxy Dataset 'Vitis2_chr1_VarScan' [Do not edit this block; Galaxy will fill it in with the annotated dataset when it is displayed.] Vitis1_chr1_VarScan Vitis2_chr1_VarScan VarScan parameters : min-coverage : Minimum read depth at a position to make a call : 4 (default :8) min-reads : Minimum supporting reads at a position to call variants : 2 (default) min-base qual : Minimum base quality at a position to count a read : 15 (default) min-var-freq : Minimum variant allele frequency threshold : 0.01 (default) min p-value : p-value threshold for calling variants : 99e-02 (default) min Freq. to call homozygote: 0.75 (default) Ignore variants with >90% support on one strand: Yes Step1 : Tag and merge multiple VarScan analysis This URGI tool concats some VarScan files and tag their results by a new column. Use [URGI : MAPHiTS - PostProcess Tools] => Tag and merge multiple VarScan analysis on input files Vitis1_chr1_VarScan and Vitis2_chr1_VarScan Rename output file : TagAndMerge_VarScan_Vitis1_Vitis2_tagged TagAndMerge_VarScan_Vitis1_Vitis2.log Step2 : VarScan Filter Page 14 sur 22

15 This tool filters the VarScan results by modify some parameters. ##Use [URGI : MAPHiTS - PostProcess Tools] => VarScan Filter on input file Vitis1_chr1_VarScan Rename output files : Vitis1_chr1_VarScan_Filter Vitis1_chr1_VarScan_Filter.log => number of SNP filtered ( passed filters) ##Use [URGI : MAPHiTS - PostProcess Tools] => VarScan Filter on input file Vitis2_chr1_VarScan Page 15 sur 22

16 Rename output files : Vitis2_chr1_VarScan_Filter Vitis2_chr1_VarScan_Filter.log => number of SNP filtered ( passed filters) Step3 : VarScan Compare This tool compares two VarScan results files (intersection / merge / unique). ##Intersection : Use [URGI : MAPHiTS - PostProcess Tools] => VarScan Compare two varscan results files (intersect / merge / unique) on input files Vitis1_chr1_VarScan and Vitis2_chr1_VarScan. This step gives the intersection results of SNP at the same position on the reference genome. Rename output files : Page 16 sur 22

VarScanCompare_Vitis1_Vitis2_Intersection => only the lines corresponding to the first input file will be written. VarScanCompare_Vitis1_Vitis2_Intersection.

17 VarScanCompare_Vitis1_Vitis2_Intersection => only the lines corresponding to the first input file will be written. VarScanCompare_Vitis1_Vitis2_Intersection.log Step4 : VarScan to GFF3 This URGI tool converts a VarScan file to a GFF3 file. Use [URGI : MAPHiTS - PostProcess Tools] => VarScan to GFF3 on input file Vitis1_chr1_VarScan_Filter Rename output file : Vitis1_chr1_VarScan_FiltertoGFF3 MAPHITS-SNPs Chip Tools Import a new dataset : Vitis_chr1.fasta (grapevine reference genome). Embedded Galaxy Dataset 'Vitis_chr1.fasta' [Do not edit this block; Galaxy will fill it in with the annotated dataset when it is displayed.] Step1 : Filter SNPs on same ref position This URGI tool selects all multiple SNP at the same reference position in the VarScan file and concatenates the results on the same line for each position. Use [URGI : MAPHiTS - SNPs Chip Tools] => Filter SNPs on same ref position on input file Vitis1_chr1_VarScan_Filter Rename output files : FilterSNPsOnSameReadPosition_Vitis1_VarScan_Filter FilterSNPsOnSameReadPosition_Vitis1_VarScan_Filter_Concatenated Step2 : Select heterozygous SNPs from concatenated varscan file Page 17 sur 22

18 This URGI tool filters the "Variant allele frequency" to select heterozygous SNPs from a concatenated VarScan File. Use [URGI : MAPHiTS - SNPs Chip Tools] => Select heterozygous SNPs from concatenated varscan file on input file FilterSNPsOnSameReadPosition_Vitis1_VarScan_Filter_Concatenated Rename output files : HeteroSNPs_Vitis1 HeteroSNPs_Vitis1.log Step3 : Keep SNPs without other SNPs in an interval This URGI tool filters a Varscan file with a set of nucleotides (N bases) number defined by users. All SNPs discribed on the output file should be identical in the interval [ Position on SNP - N ; Position on SNP + N ]. The Varscan input file must be sorted by references and positions! Use [URGI : MAPHiTS - SNPs Chip Tools] => Keep SNPs without other SNPs in an interval on input file Vitis1_chr1_VarScan_Filter Rename output file : SNPsWithoutOtherSNP_Vitis1_chr1_VarScan_filter Step4 : Keep SNPs without N in an interval This URGI tool filters a Varscan file with a set of nucleotides (N) number defined by users. All SNPs displayed on the output file haven't got 'N' in this interval [ Position on SNP - N ; Position on SNP + N ]. Use [URGI : MAPHiTS - SNPs Chip Tools] => Keep SNPs without N in an interval on input file Vitis1_chr1_VarScan_Filter Page 18 sur 22

Rename output file : SNPsWithoutN_Vitis1_chr1_VarScan_filter Step5 : Extract SNPs with flanks This URGI tool creates a fasta file with SNPs from the Varscan input file with their

Use [URGI : MAPHiTS - SNPs Chip Tools] => Extract SNPs with flanks on input file SNPsWithoutOtherSNP_Vitis1_chr1_VarScan_filter Rename output file :

19 Rename output file : SNPsWithoutN_Vitis1_chr1_VarScan_filter Step5 : Extract SNPs with flanks This URGI tool creates a fasta file with SNPs from the Varscan input file with their 5' and 3' flanks from the reference genome. Use [URGI : MAPHiTS - SNPs Chip Tools] => Extract SNPs with flanks on input file SNPsWithoutOtherSNP_Vitis1_chr1_VarScan_filter Rename output file : ExtractSNPWithFlanks_SNPwithoutOtherSNP_Vitis1_chr1_VarScan_filter Step6 : Filter sequences > N% or GC% This URGI tool filters fasta sequences with a given percentage of GC or N. ##Use [URGI : MAPHiTS - SNPs Chip Tools] => Filter sequence > N% or GC% on input file ExtractSNPWithFlanks_SNPwithoutOtherSNP_Vitis1_chr1_VarScan_filter Page 19 sur 22

20 Rename output files : FilterN_1%.log FilterN_1%_Fasta -> Lower (sequences with < 1% N) FilterN_1%_Fasta -> Greater (sequences > 1% N) ##Use [URGI : MAPHiTS - SNPs Chip Tools] => Filter sequence > N% or GC% on input file ExtractSNPWithFlanks_SNPwithoutOtherSNP_Vitis1_chr1_VarScan_filter Rename output files : FilterGC_35%.log FilterGC_35% Fasta -> Lower (sequences with < 35% GC) FilterGC_35% Fasta -> Greater (sequences > 35% GC) Here is an history with these results : Embedded Galaxy History 'TP_MAPHITS_Part2' [Do not edit this block; Galaxy will fill it in with the annotated history when it is displayed.] Page 20 sur 22

21 Page 21 sur 22

22 Page 22 sur 22

INTRODUCTION AUX FORMATS DE FICHIERS

INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan