CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

Similar documents
From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /

Analysing re-sequencing samples. Malin Larsson WABI / SciLifeLab

Analysing re-sequencing samples. Anna Johansson WABI / SciLifeLab

Briefly: Bioinformatics File Formats. J Fass September 2018

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

INTRODUCTION AUX FORMATS DE FICHIERS

SweeD 3.0. Pavlos Pavlidis & Nikolaos Alachiotis

TCGA Variant Call Format (VCF) 1.0 Specification

NGS Data Analysis. Roberto Preste

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Read Mapping and Variant Calling

Exome sequencing. Jong Kyoung Kim

SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call

Mapping NGS reads for genomics studies

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

AgroMarker Finder manual (1.1)

Variation among genomes

Variant calling using SAMtools

elprep: a high- performance tool for preparing SAM/BAM files for variant calling Charlo<e Herzeel (Imec) Pascal Costanza (Intel) July 2014

Practical exercises Day 2. Variant Calling

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

NGS Sequence data. Jason Stajich. UC Riverside. jason.stajich[at]ucr.edu. twitter:hyphaltip stajichlab

Bioinformatics in next generation sequencing projects

High-throughout sequencing and using short-read aligners. Simon Anders

Calling variants in diploid or multiploid genomes

Reads Alignment and Variant Calling

NA12878 Platinum Genome GENALICE MAP Analysis Report

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

RPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts.

Handling sam and vcf data, quality control

RNA-seq. Manpreet S. Katari

Overview and Implementation of the GBS Pipeline. Qi Sun Computational Biology Service Unit Cornell University

Overview and Implementation of the GBS Pipeline. Qi Sun Computational Biology Service Unit Cornell University

MPG NGS workshop I: Quality assessment of SNP calls

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September

Galaxy workshop at the Winter School Igor Makunin

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

Genomics. Nolan C. Kane

DNA Sequencing analysis on Artemis

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Sequence Mapping and Assembly

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Galaxy Platform For NGS Data Analyses

Analysis of ChIP-seq data

Helpful Galaxy screencasts are available at:

ChIP-seq (NGS) Data Formats

mageri Documentation Release Mikhail Shugay

NGS Data Visualization and Exploration Using IGV

Dindel User Guide, version 1.0

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

Under the Hood of Alignment Algorithms for NGS Researchers

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

NGS Analysis Using Galaxy

Decrypting your genome data privately in the cloud

Bioinformatics Framework

Falcon Accelerated Genomics Data Analysis Solutions. User Guide

Maize genome sequence in FASTA format. Gene annotation file in gff format

3. Installation Download Cpipe and Run Install Script Create an Analysis Profile Create a Batch... 7

GBS Bioinformatics Pipeline(s) Overview

MiSeq Reporter Amplicon DS Workflow Guide

Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING)

MiSeq Reporter TruSight Tumor 15 Workflow Guide

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual

Halvade: scalable sequence analysis with MapReduce

The Variant Call Format (VCF) Version 4.2 Specification

Isaac Enrichment v2.0 App

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

PRACTICAL SESSION 8 SEQUENCE-BASED ASSOCIATION, INTERPRETATION, VISUALIZATION USING EPACTS JAN 7 TH, 2014 STOM 2014 WORKSHOP

SNP Calling. Tuesday 4/21/15

Intro to NGS Tutorial

RVD2.7 command line program (CLI) instructions

Lecture 12. Short read aligners

Genome Data Management using RDBMSs

Tutorial. Identification of Variants Using GATK. Sample to Insight. November 21, 2017

Practical Linux Examples

freebayes in depth: model, filtering, and walkthrough Erik Garrison Wellcome Trust Sanger of Iowa May 19, 2015

Introduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)

ADNI Sequencing Working Group. Robert C. Green, MD, MPH Andrew J. Saykin, PsyD Arthur Toga, PhD

Ensembl RNASeq Practical. Overview

Sequence Analysis Pipeline

Local Run Manager Resequencing Analysis Module Workflow Guide

Sentieon Documentation

RCAC. Job files Example: Running seqyclean (a module)

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

NGS Analyses with Galaxy

Rsubread package: high-performance read alignment, quantification and mutation discovery

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

Tutorial on gene-c ancestry es-ma-on: How to use LASER. Chaolong Wang Sequence Analysis Workshop June University of Michigan

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

myvcf Documentation Release latest

Rsubread package: high-performance read alignment, quantification and mutation discovery

Mapping reads to a reference genome

Data Walkthrough: Background

Transcription:

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for Vertebrate Genomics (CVG)

Two different data analysis strategies Assembly Alignment

De novo Assembly ACGAGCAACACGGTACCTA ACGGTACCTAAACCGG TACCTAAACCGGA TACCTAAACCGGACCCGGAAAGAC ACGAGCAACACGGTAGCTA ACGGTAGCTAAACCGG TAGCTAAACCGGA TAGCTAAACCGGACCCGGAAAGAC...ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC...

De novo Assembly...ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC... ACGAGCAACACGGTACCTA ACGGTACCTAAACCGG TACCTAAACCGGA TACCTAAACCGGACCCGGAAAGAC ACGAGCAACACGGTAGCTA ACGGTAGCTAAACCGG TAGCTAAACCGGA TAGCTAAACCGGACCCGGAAAGAC...ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC...

Reference Alignment ACGAGCAACACGGTACCTA ACGGTACCTAAACCGG TACCTAAACCGGA TACCTAAACCGGACCCGGAAAGAC TAGCTAAACCGGA ACGGTAGCTAAACCGG ACGAGCAACACGGTAGCTA TAGCTAAACCGGACCCGGAAAGAC

Reference Alignment Reference Genome C ACGAGCAACACGGTACCTA ACGGTACCTAAACCGG TACCTAAACCGGA TACCTAAACCGGACCCGGAAAGAC ACGAGCAACACGGTAGCTA ACGGTAGCTAAACCGG TAGCTAAACCGGA TAGCTAAACCGGACCCGGAAAGAC TAGCTAAACCGGA ACGAGCAACACGGTACCTA ACGGTAGCTAAACCGG ACGGTACCTAAACCGG TACCTAAACCGGA ACGAGCAACACGGTAGCTA TACCTAAACCGGACCCGGAAAGAC TAGCTAAACCGGACCCGGAAAGAC

With limited number of individuals, whole genome/exome sequencing do not always reveal the causative mutations Chr Position Ref Coverage Depth Genotypes Gene chr1 24515167 C 5 11 3 T() C() T() chr1 45396856 G 13 7 9 C() G() C() chr1 68417006 G 43 18 6 A() G() A() chr1 90162621 A 15 99 255 M(AC) A() A() chr1 90162696 G 17 134 255 G() R(GA) G() chr1 90162750 C 19 108 176 Y(CT) Y(CT) C() chr1 90162816 G 30 72 106 G() K(GT) K(GT) chr1 90162975 G 162 48 255 G() R(GA) G() chr1 90163027 C 100 6 255 C() Y(CT) Y(CT) chr1 90163136 A 152 17 176 A() R(AG) R(AG) chr1 90163167 C 132 25 218 C() M(CA) M(CA) chr1 90163191 T 91 19 227 T() Y(TC) Y(TC) chr1 90164490 A 173 16 103 A() M(AC) M(AC) chr1 90164557 A 100 66 137 A() R(AG) A() chr1 90164612 A 62 48 107 A() R(AG) R(AG) chr1 90164677 A 88 37 64 R(AG) A() R(AG) chr1 90165817 T 88 35 56 Y(TC) Y(TC) T() chr17 72952985 C 23 26 31 T() Y(TC) T() chr18 7355152 G 23 34 3 A() G() A() chr18 7355177 A 16 29 3 C() A() C() chr18 25274226 T 28 35 22 C() Y(CT) C() chr18 34475963 A 25 12 25 G(KT) R(GA) G() chr18 38133671 G 69 63 21 C(SG) G() G() chr18 65363507 G 14 29 3 T(KG) G() T() chr18 65363509 T 18 31 3 G(KT) T() G() chr18 71606111 C 9 32 5 A() C() A() chr19 46381078 A 8 12 6 G(RA) A() G()

With limited number of individuals, whole genome/exome sequencing do not always reveal the causative mutations Chr Position Ref Coverage Depth Genotypes Gene chr1 24515167 C 5 11 3 T() C() T() chr1 45396856 G 13 7 9 C() G() C() chr1 68417006 G 43 18 6 A() G() A() chr1 90162621 A 15 99 255 M(AC) A() A() chr1 90162696 G 17 134 255 G() R(GA) G() chr1 90162750 C 19 108 176 Y(CT) Y(CT) C() chr1 90162816 G 30 72 106 G() K(GT) K(GT) chr1 90162975 G 162 48 255 G() R(GA) G() chr1 90163027 C 100 6 255 C() Y(CT) Y(CT) chr1 90163136 A 152 17 176 A() R(AG) R(AG) chr1 90163167 C 132 25 218 C() M(CA) M(CA) chr1 90163191 T 91 19 227 T() Y(TC) Y(TC) chr1 90164490 A 173 16 103 A() M(AC) M(AC) chr1 90164557 A 100 66 137 A() R(AG) A() chr1 90164612 A 62 48 107 A() R(AG) R(AG) chr1 90164677 A 88 37 64 R(AG) A() R(AG) chr1 90165817 T 88 35 56 Y(TC) Y(TC) T() chr17 72952985 C 23 26 31 T() Y(TC) T() chr18 7355152 G 23 34 3 A() G() A() chr18 7355177 A 16 29 3 C() A() C() chr18 25274226 T 28 35 22 C() Y(CT) C() chr18 34475963 A 25 12 25 G(KT) R(GA) G() chr18 38133671 G 69 63 21 C(SG) G() G() chr18 65363507 G 14 29 3 T(KG) G() T() chr18 65363509 T 18 31 3 G(KT) T() G() chr18 71606111 C 9 32 5 A() C() A() chr19 46381078 A 8 12 6 G(RA) A() G() Sequence a mapping population

Reference genome based sequence variation detection Step 1: Alignment FASTQ files Step 2: Call SNP/INDELs SAM/BAM files VCF file

Reference genome based sequence variation detection Step 3: Filter SNP/INDELs Step 4: Annotate SNP/INDELs

Reference genome based sequence variation detection Step 1: Alignment BWA Li H. and Durbin R. (2009) Bioinformatics, 25:1754 60 Step 2: Call SNP/INDELs SAMtools or GATK + Picard Li H. et al. Bioinformatics, 25, 2078 9 Broad Institute

Reference genome based sequence variation detection Step 3: Filtering GATK Write your own code Step 4: Annotation Annovar http://www.openbioinformatics.org/annovar/

Standard file formats FASTQ SAM/BAM VCF

FASTQ file: @20F75AAXX:5:1:335:1565 ACCTTGTTGAGAAACAGGAGGTGTTGTTCTTCAAAG +20F75AAXX:5:1:335:1565 ]]]]][]][][[][]Z[[[][[[[][[[[][[[[[R @20F75AAXX:5:1:466:1056 GGAAGCAACAGCTAATACATGAATGGATATCGATCG +20F75AAXX:5:1:466:1056 []]]]][]]]Y]]]][Y[[[[[[[[[[Y[Y[YW[[[ @20F75AAXX:5:1:256:1724 GCCCAACAAAGACCGGTCACCAAAGACAGATGATTC +20F75AAXX:5:1:256:1724 ]][]][]][[[[]L[[[[][[[Z[[[[[S[[ZW[[[

SAM file: HWI EAS83_20F7TAAXX:1:1: 379:338 16 4 157555988 25 36M * 0 0 HWI EAS83_20F7TAAXX:1:1: 582:80 4 * 0 0 * * 0 0 AGAAAACT GCAAAGCA CGAGTCTA GCAGATAC h?dhhhld POhhhhhh hhhhhhhh hhhhhhhh hhhh XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:2C32 G0 CCTT GCACCCTTT VbINbYZh_ AACTCGGG huhqhd\^ HWI CTAACTATC hfhhhhhhh EAS83_20F7TAAXX:1:1: TTGCTTCAC hhhhhhhh 98:170 16 4 28122708 37 36M * 0 0 C hh XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:33G2 ATGGCTGC hfhhhhahh CTCGCAGA `hhavheha ATCGAAAG hqkhkqa_ TTAGTGCC IIPPF@DhE HWI EAS83_20F7TAAXX:1:1: 169:517 16 3 170277940 25 36M * 0 0 GCAC AAAACCAT ATCTGCTG GAAACTCT GCTTCCAC AAGC V CDhKDBhD hfagghmh ahhhhphh hhhhhhhh hhhh XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:0T0C 34 Information encoded in SAM file Sequence (forward strand of the reference genome) Quality score Alignment information (position, strand, mismatches, gap) Ambigous alignments Paired end information Read group

BAM is a compressed SAM file BAM file is several times smaller than SAM; BAM file can be indexed and queried; Most software operates directly on BAM; BAM format can potentially replace fastq format.

VCF file variant call format ##fileformat=vcfv4.0 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=1000genomespilot NCBI36 ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1:43:5:.,. 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2:35:4 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Alignment with BWA Commonly used parameters: Alignment step (aln): n: maximum number of edit distance (default 0.04) o: maximum number of gap opens (default 1) Write SAM file step (samse or sampe): n maximum number of alignments to report

Converting SAM to BAM Index BAM Samtools: Picard: view; index SamFormatConverter; BuildBamIndex *** If you want to use Broad GATK software to call SNPs, do not use SAMtools, always use Picard for processing SAM and BAM files.

BAM file can be visualized with IGV software

Clean up the BAM file Mark possible PCR duplicates Base quality score recalibration Local realignment around indels

Clean up the BAM file Mark possible PCR duplicates ** For sequence reads with exact same sequence, only one copy is kept. Base quality score recalibration Local realignment around indels

Clean up the BAM file Mark possible PCR duplicates Base quality score recalibration Phred quality score: 20 > 1% error rate. Illumina quality score: 0 to 62, need to be calibrated to reflect error rate. Local realignment around indels

Clean up the BAM file Mark possible PCR duplicates Base quality score recalibration Local realignment around indels

Multi sample SNP and INDEL calling Use Unified Genotyper (GATK) or mpileup (SAMtools) to call SNP and INDEL from multiple samples. Set the variants calling threshold Emission threshold: Q10 (>10x) Q3(<10x) Confidence threshold: Q30(>10x) Q4(<10x)

Filtering Read depth (DP) Allele frequency (AF) Number of samples with data (NS)

SAMtools GATK/Picard SAM > BAM Flag possible PCR duplicates Quality score calibration INDEL realignment * Call variants on multiple samples Filtering ** * SAMtools mpileup has built in realignment tool ** Limited filtering function. Poor documentation.

GATK Documentation: http://www.broadinstitute.org/gsa/wiki/index.php/best_practice_variant_detection_with_the_gatk_v2

SAMtools Variants Calling Documentation: http://samtools.sourceforge.net/mpileup.shtml

Practical aspects 1. Experimental Design. 2. Computational Resource at Cornell.

Whole genome sequencing vs Targeted sequencing Target enrichment by array or in solution based capturing technology. (e.g. Exome sequencing).

Whole genome sequencing vs Genotyping by Sequencing (GBS) ApeK I site Line 1 Line 2 Line 3 Ed Buckler Lab (http://www.maizegenetics.net/gbs overview)

Advantage of GBS over whole genome sequencing 1. Reduced cost by multiplexing; 2. Possible to map markers that are not on the reference genome;

To identify causative mutations in a mutant strain, it is necessary to use both sequencing and genetic linkage analysis.

Mapping and Mutation Identification of the Pooled F2 population * * X F1 * F2 * ***

Using SHOREmap for mapping and mutation identification SHOREmap Schneeberger K et al (2009) Nat Methods.6(8):550 1.

Alternative approach: test for enrichment of new mutations Zuryn et al. (2010) A Strategy for Direct Mapping and Identification of Mutations by Whole Genome Sequencing. Genetics 186: 427 430

Computational Resource at Cornell CBSU / 3CPG BioHPC Laboratory (625 Rhodes Hall) Office Hour: 1:00 to 3:00 PM every Monday. Email cbsu@cornell.edu to get an BioHPC lab account.

Training workshops Linux for Biologists Programming workshop (PERL)