Reads Alignment and Variant Calling

Similar documents
RPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts.

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

Analysing re-sequencing samples. Anna Johansson WABI / SciLifeLab

Analysing re-sequencing samples. Malin Larsson WABI / SciLifeLab

Exome sequencing. Jong Kyoung Kim

SweGen: A whole-genome data resource of genetic variability in a cross-section of the Swedish population

Falcon Accelerated Genomics Data Analysis Solutions. User Guide

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

SNP Calling. Tuesday 4/21/15

Decrypting your genome data privately in the cloud

Sentieon DNA Pipeline for Variant Detection Software-only solution, over 20 faster than GATK 3.3 with identical results

Sentieon DNA pipeline for variant detection - Software-only solution, over 20 faster than GATK 3.3 with identical results

Practical exercises Day 2. Variant Calling

NA12878 Platinum Genome GENALICE MAP Analysis Report

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

Introduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)

NGS Sequence data. Jason Stajich. UC Riverside. jason.stajich[at]ucr.edu. twitter:hyphaltip stajichlab

Halvade: scalable sequence analysis with MapReduce

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Kelly et al. Genome Biology (2015) 16:6 DOI /s x. * Correspondence:

AgroMarker Finder manual (1.1)

Sentieon Documentation

Mapping NGS reads for genomics studies

Variation among genomes

Galaxy Platform For NGS Data Analyses

MPG NGS workshop I: Quality assessment of SNP calls

Mapping reads to a reference genome

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

NGS Data Analysis. Roberto Preste

Ensembl RNASeq Practical. Overview

INTRODUCTION AUX FORMATS DE FICHIERS

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Resequencing and Mapping. Andreas Gisel Inernational Institute of Tropical Agriculture (IITA) Ibadan, Nigeria

Lecture 12. Short read aligners

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018

RNA-seq. Manpreet S. Katari

Sequence Analysis Pipeline

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Dindel User Guide, version 1.0

NGS Analysis Using Galaxy

NGS Data Visualization and Exploration Using IGV

Handling sam and vcf data, quality control

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

Package HTSeqGenie. April 16, 2019

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /

CallHap: A Pipeline for Analysis of Pooled Whole-Genome Haplotypes Last edited: 8/8/2017 By: Brendan Kohrn

Under the Hood of Alignment Algorithms for NGS Researchers

Mapping and Viewing Deep Sequencing Data bowtie2, samtools, igv

Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual

Bioinformatics Framework

3. Installation Download Cpipe and Run Install Script Create an Analysis Profile Create a Batch... 7

Accelrys Pipeline Pilot and HP ProLiant servers

High-throughout sequencing and using short-read aligners. Simon Anders

Maize genome sequence in FASTA format. Gene annotation file in gff format

Genomic Files. University of Massachusetts Medical School. October, 2014

NGS Data and Sequence Alignment

Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study

Hinri Kerstens. NGS pipeline using Broad's Cromwell

Bioinformatics in next generation sequencing projects

Sequence mapping and assembly. Alistair Ward - Boston College

Basics of high- throughput sequencing

Aligners. J Fass 23 August 2017

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Read mapping with BWA and BOWTIE

DNA Sequencing analysis on Artemis

Local Run Manager Resequencing Analysis Module Workflow Guide

Super-Fast Genome BWA-Bam-Sort on GLAD

Read Mapping and Variant Calling

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

elprep: a high- performance tool for preparing SAM/BAM files for variant calling Charlo<e Herzeel (Imec) Pascal Costanza (Intel) July 2014

Toward High Utilization of Heterogeneous Computing Resources in SNP Detection

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

Analyzing Variant Call results using EuPathDB Galaxy, Part II

Helpful Galaxy screencasts are available at:

Illumina Next Generation Sequencing Data analysis

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

halvade Documentation

Genomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun

Evaluate NimbleGen SeqCap Epi Target Enrichment Data

BWT Indexing: Big Data from Next Generation Sequencing and GPU

A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing

Aligners. J Fass 21 June 2017

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Bioinformatica e analisi dei genomi

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)

MiSeq Reporter TruSight Tumor 15 Workflow Guide

Genomic Files. University of Massachusetts Medical School. October, 2015

java -jar picard.jar CollectInsertSizeMetrics I=aln_sorted.bam O=out.metrics HISTOGRAM_FILE=chartoutput.pdf VALIDATION_STRINGENCY=LENIENT

Sequence Mapping and Assembly

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

Evaluate NimbleGen SeqCap RNA Target Enrichment Data

Transcription:

Reads Alignment and Variant Calling CB2-201 Computational Biology and Bioinformatics February 22, 2016 Emidio Capriotti http://biofold.org/ Institute for Mathematical Modeling of Biological Systems Department of Biology

Genome Analysis Alignment is different from Assembly: Alignment aims to find the best matches of a particular read to a reference genome Assembly finds the best overlaps among the reads to determine the most likely genome

Mapping Given the level of variability across individuals it is expected that an high fraction of reads will not map. The complete human pan-genome will require additional 19-40 Mb of novel sequences. Nature Biotechnology 28, 57 63 (2010)

The Alignment The main limitation is the computational time. the best method should have a good balance between cpu and memory usage, speed and accuracy. The most popular methods (BWA and Bowtie) are based on suffix array: a sorted array of all suffixes of a string. BWA and other aligners such as Bowtie use an implementation of the Burrows Wheeler transform algorithm which is a technique for data compression.

Testing methods http://lh3lh3.users.sourceforge.net/alnroc.shtml

Genome Indexing The first step consist in the creation of an indexed reference genome. > bwa index $db.fasta two useful option: -p database prefix -a indexing algorithm bwtsw for large genome (> 50,000,000 BP) is for smaller genomes Reference from different institutions ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/

Other Indexing GATK requires two specific index files with extension fai and dict. The fai file is generated by samtools > samtools faidx $db.fasta The dict file is generated by picard >java -jar CreateSequenceDictionary.jar \ REFERENCE=$db.fasta OUTPUT=$db.dict

Variant Calling Steps Reference Genome SAM Format Marking Duplicates NGS reads BWA or Bowtie BAM Format Sorted BAM Realignment Recalibration Variant Calling Variant Recalibration Reads Alignment and Manipulation BWA, Bowtie, SAMTools Reads filtering Picard, GATK Variant Calling GATK, Freebayes

Alignments of the reads The alignment is generated using bwa aln > bwa aln -t [opts] $db $file.fastq two useful option: -f output file -t number of threads If you are have analyzing pair-end sequencing you need to repeat the alignment for both fastq files.

Generate the SAM file The SAM file is generated using bwa samse (single-end) or sampe (pair-end) > bwa samse $db $file.sai $file.fastq > bwa sampe $db $file1.sai $file2.sai $file1.fastq $file2.fastq two useful option: -f output file -r group_info for example @RG\tID:..\tLB:..\tSM:..\tPL:ILLUMINA

SAM to BAM To save space and make all the process faster we can convert the SAM file to BAM using samtools > samtools view -bs $file.sam > $file.bam useful option: -b output BAM -S input SAM -T reference genome if header is missing

Samtools functions Samtools can be used to perform several tasks: Sorting BAM file > samtools sort $file.bam -o $file.sorted Create an index > samtools index $file.bam $file.bai Filtering out unmapped reads > samtools view -h -F 4 $file.bam $file.mapped.bam Select properly paired reads > samtools view -h -f 0X0002 $file.bam $file.paired.bam

Optical Duplication Optical duplicates are due to a read being read twice. The number of the duplicates depends on the depth of the sequencing, the library and sequencing technology. https://www.broadinstitute.org/gatk/

Picard Duplication can be detect comparing the CIGAR (Short way to represent alignment with reference sequence). Picard is used to mark the duplicated reads. # Mark duplicate java -Xmx4g -Djava.io.tmpdir=/tmp -jar MarkDuplicates.jar \ INPUT=$file.bam OUTPUT=$file.marked.bam METRICS_FILE=metrics \ CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT https://www.broadinstitute.org/gatk/

Indel realignment The mapping of indels especially in regions near to the ends can be seen as mismatches. Consecutive variants close to the ends are suspicious. https://www.broadinstitute.org/gatk/

Better Alignment The realignment around the indels improve the quality of the alignment RealignerTargetCreator IndelRealigner https://www.broadinstitute.org/gatk/

Realignment After marking the duplicated reads GATK the alignment is recalculated to improve the mach to the indels. # Local realignment java -Xmx4g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator \ -R $db.fa -o $file.bam.list -I $file.marked.bam java -Xmx4g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK.jar \ -I $file.marked.bam -R $genome -T IndelRealigner \ -targetintervals $file.bam.list -o $file.marked.realigned.bam # Picard realignment java -Djava.io.tmpdir=/tmp/flx-auswerter -jar FixMateInformation.jar \ INPUT=$file.marked.realigned.bam OUTPUT=$file.marked.realigned.fixed.bam \ SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

Quality Score The quality score are critical for the downstream analysis. This score depends on the nucleotide context. https://www.broadinstitute.org/gatk/

The recalibration step After removing the duplicated reads scores need to be recalibrated using GATK. The recalibration is detected calculating the covariation among nucleotide features. # Recalibration java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator \ -R $db.fa -I $file.marked.realigned.fixed.bam -knownsites $dbsnp -o \ $output.recal_data.csv java -jar GenomeAnalysisTK.jar -T PrintReads \ -R $db.fa -I $file.marked.realigned.fixed.bam \ -BQSR $file.recal_data.csv -o $file.marked.realigned.fixed.recal.bam

The Variant Calling There are two possible options: UnifiedGenotyper and HyplotypeCaller # Variant calling java -Xmx4g -jar GenomeAnalysisTK.jar -glm BOTH -R $db.fa \ -T UnifiedGenotyper -I $file.marked.realigned.fixed.recal.bam \ -D $dbsnp -o $file.vcf -metrics snps.metrics \ -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A AlleleBalance or java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R $db.fa \ -I $file.marked.realigned.fixed.recal.bam --emitrefconfidence GVCF \ variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp $dbsnp \ -o $output.recalibrated.vcf UnifiedGenotyper is faster and HyplotypeCaller is more accurate on the detection of indels.

Improve Variant Calling The final step consist in recalibration and filtering. The letter are based on previously known mutation events. # Variant Quality Score Recalibration java -jar GenomeAnalysisTK.jar T VariantRecalibrator\ R $db.fa input $file.vcf \ resource: {dbsnp, 1000Genomes, Haplotype } \ an QD an MQ an HaplotypeScore { } \ mode SNP recalfile $file.snps.recal \ tranchesfile $file.recalibrated.tranches java -jar GenomeAnalysisTK.jar -T ApplyRecalibration \ -R $db.fa -input $file.vcf -mode SNP\ -recalfile $file.snps.recal -tranchesfile raw.snps.tranches \ -o $file.recalibrated.vcf -ts_filter_level 99.0 # Variant Filtering java -Xmx4g -jar GenomeAnalysisTK.jar -R $db.fa \ -T VariantFiltration -V $file.recalibrated.vcf \ -o $file.recalibrated.filtered.vcf --clusterwindowsize 10 \ --filterexpression some filter --filtername filter_name

Visualization Broad Institute have developed the Integrative Genome Viewer (IGV) for the visualization of genomic data from different sources of data. http://cmb.path.uab.edu/training/docs/cb2-201-2015/igv_2.3.40.zip

For more details Samtools http://www.htslib.org/doc/ GATK Guide https://www.broadinstitute.org/gatk/guide/ Best practices for variant calling with GATK http://www.broadinstitute.org/partnerships/education/broade/bestpractices-variant-calling-gatk IGV http://www.broadinstitute.org/igv/