Sequence Mapping and Assembly

Similar documents
Variant Calling and Filtering for SNPs

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

Genomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun

INTRODUCTION AUX FORMATS DE FICHIERS

PRACTICAL SESSION 8 SEQUENCE-BASED ASSOCIATION, INTERPRETATION, VISUALIZATION USING EPACTS JAN 7 TH, 2014 STOM 2014 WORKSHOP

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

NGS Data Analysis. Roberto Preste

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Variant calling using SAMtools

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements

Galaxy Platform For NGS Data Analyses

DNA Sequencing analysis on Artemis

NGS Data Visualization and Exploration Using IGV

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call

Genomic Files. University of Massachusetts Medical School. October, 2014

Helpful Galaxy screencasts are available at:

Exome sequencing. Jong Kyoung Kim

Decrypting your genome data privately in the cloud

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Introduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

Variation among genomes

Ensembl RNASeq Practical. Overview

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Analyzing ChIP- Seq Data in Galaxy

Genomic Files. University of Massachusetts Medical School. October, 2015

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

Falcon Accelerated Genomics Data Analysis Solutions. User Guide

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

Intro to NGS Tutorial

Calling variants in diploid or multiploid genomes

Super-Fast Genome BWA-Bam-Sort on GLAD

Sentieon Documentation

ChIP-seq (NGS) Data Formats

Lecture 12. Short read aligners

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /

Tutorial. Identification of Variants Using GATK. Sample to Insight. November 21, 2017

Galaxy workshop at the Winter School Igor Makunin

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

Bioinformatics in next generation sequencing projects

High-throughout sequencing and using short-read aligners. Simon Anders

Contact: Raymond Hovey Genomics Center - SFS

AgroMarker Finder manual (1.1)

NGS Data and Sequence Alignment

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

Copyright 2014 Regents of the University of Minnesota

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual

Part 1: How to use IGV to visualize variants

Atlas-SNP2 DOCUMENTATION V1.1 April 26, 2010

MPG NGS workshop I: Quality assessment of SNP calls

Bioinformatics Framework

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

NA12878 Platinum Genome GENALICE MAP Analysis Report

The SAM Format Specification (v1.3-r837)

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

An Introduction to Linux and Bowtie

Sequence Analysis Pipeline

Mapping NGS reads for genomics studies

elprep: a high- performance tool for preparing SAM/BAM files for variant calling Charlo<e Herzeel (Imec) Pascal Costanza (Intel) July 2014

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

Halvade: scalable sequence analysis with MapReduce

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

Tutorial on gene-c ancestry es-ma-on: How to use LASER. Chaolong Wang Sequence Analysis Workshop June University of Michigan

Read Mapping and Variant Calling

Fusion Detection Using QIAseq RNAscan Panels

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

Sequence mapping and assembly. Alistair Ward - Boston College

Kelly et al. Genome Biology (2015) 16:6 DOI /s x. * Correspondence:

Handling sam and vcf data, quality control

Analyzing massive genomics datasets using Databricks Frank Austin Nothaft,

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.

The SAM Format Specification (v1.3 draft)

ChIP-Seq Tutorial on Galaxy

Copyright 2014 Regents of the University of Minnesota

Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows

NGS Sequence data. Jason Stajich. UC Riverside. jason.stajich[at]ucr.edu. twitter:hyphaltip stajichlab

Understanding and Pre-processing Raw Illumina Data

GenomeStudio Software Release Notes

RPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts.

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN. Sophie Gallina CNRS Evo-Eco-Paléo (EEP)

ADNI Sequencing Working Group. Robert C. Green, MD, MPH Andrew J. Saykin, PsyD Arthur Toga, PhD

BaseSpace - MiSeq Reporter Software v2.4 Release Notes

NGS : reads quality control

freebayes in depth: model, filtering, and walkthrough Erik Garrison Wellcome Trust Sanger of Iowa May 19, 2015

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

Copy Number Variations Detection - TD. Using Sequenza under Galaxy

NGS Analyses with Galaxy

Step-by-Step Guide to Relatedness and Association Mapping Contents

Reads Alignment and Variant Calling

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

Read Naming Format Specification

Transcription:

Practical Introduction Sequence Mapping and Assembly December 8, 2014 Mary Kate Wing University of Michigan Center for Statistical Genetics

Goals of This Session Learn basics of sequence data file formats FASTQ & BAM Raw sequence reads -> aligned sequences Get ready for variant calling Many methods/pipelines, we cover 1 Evaluate quality of sequence data Visualize sequence data to examine reads aligned to particular genomic positions

Session Design A few intro slides Introduces you to how to do each of the goals Instructions for you to follow Walkthrough of how to produce aligned reads Screenshots with explanations Raise your hand if you have any questions/problems Someone will come help

Raw Sequence Reads (FASTQs) Standard file format from sequencing Sequencing done as series of reads Not associated with a chromosome/position Reads can be in pairs Typically separate file for 1st/2nd in pair http://en.wikipedia.org/wiki/fastq_format

Raw Sequence Reads (FASTQs) 4 lines per read Starts with @ 1) Read Name 2) Sequence Bases 3) + 4) Base Qualities 1) Read Name 2) Sequence Bases 3) + 4) Base Qualities 1) Read Name 2) Sequence Bases 3) + 4) Base Qualities First in pair

Raw Sequence Reads (FASTQs) Base Qualities ASCII quality code for each base 33 + phred scale = 33 + -10log10 e e is estimate probability of an incorrect base Lower qualities: special characters/digits! (Q=0), (Q=1), # (Q=3), + (Q-10), / (Q=14) 0 (Q=15), 5 (Q=20), 9 (Q=24) Higher qualities (>Q30): alphabetic characters : (Q=25),? (Q=30), @ (Q=31) A (Q=32), B (Q=33), G (Q=38) Will be recalibrated in alignment pipeline By sequencing run/fastq pair Become more accurate

Sequence Alignment/Map Format: SAM/BAM Maps read to Chromosome & Position Spec: http://samtools.github.io/hts-specs/samv1.pdf More Info: http://genome.sph.umich.edu/wiki/sam Header lines Each line starts with @ Records One for each sequence read/fastq record FASTQ info PLUS Chr/Pos

SAM/BAM Records Header

SAM/BAM Records Header Chr info

SAM/BAM Records Header Chr info Read Group

SAM/BAM Records Header Chr info Read Group Record 1 Record 2

SAM/BAM Records Header Chr info Read Group Read Name from FASTQ (no @, /1, /2 ) Record 2

SAM/BAM Records Header Chr info Read Group Read Name from FASTQ (no @, /1, /2 ) Record 2 Chromosome/position

SAM/BAM Records Header Chr info Mapping to reference info M: match/mismatch I: insertion, D: deletion Read Group Read Name from FASTQ (no @, /1, /2 ) Record 2 Chromosome/position

SAM/BAM Records Header Chr info Mapping to reference info M: match/mismatch I: insertion, D: deletion Read Group Read Name from FASTQ (no @, /1, /2 ) Record 2 Chromosome/position paired-end, mate chr/pos

SAM/BAM Records Header Chr info Mapping to reference info M: match/mismatch I: insertion, D: deletion Read Group Read Name from FASTQ (no @, /1, /2 ) Sequence (from FASTQ) Chromosome/position paired-end, mate chr/pos

SAM/BAM Records Header Chr info Mapping to reference info M: match/mismatch I: insertion, D: deletion Read Group Read Name from FASTQ (no @, /1, /2 ) Sequence (from FASTQ) Recalibrated Quality Chromosome/position paired-end, mate chr/pos

Viewing SAM/BAM Files Samtools - what we will use today http://samtools.sourceforge.net/ view read group, library, MAPQ >, region tview text alignment viewer - visualize reads by position BamUtil (Michigan tool) http://genome.sph.umich.edu/wiki/bamutil Lot s of SAM/BAM tools Other tools Picard, GATK, BamTools

High Quality BAMs from FASTQs 1. Map FASTQ reads to reference genome Identify most likely chromosome/position for each read 2. Remove Duplicates Remove duplicates so they are not counted multiple times as evidence for variant 3. Recalibrate Base Qualities Improve qualities from sequencer

High Quality BAMs from FASTQs Many tools & best practices to choose from Our solution: Genomes on the Cloud (GotCloud) Sequence analysis pipelines You don t need to know the details of individual components Automates steps for you

What is Genomes On The Cloud? Integrative Seamless Robust Scalable Alignment, QC, Variant Calling, Phasing Requires only simple configuration files..against unexpected failures & stops..to many thousands of genomes GotCloud also provides Set of many useful software tools Software library (C++) for sequence analysis

GotCloud Pipelines Robust parallelization Automatically partitions Alignment: by sample Others: by region Takes advantage of clusters Supports MOSIX, SLURM, SGE, PBS Can setup a cluster on Amazon Amazon Machine Image (AMI) available via GNU make Reliable and fault-tolerant Restart where it stopped

GotCloud Alignment Pipeline Overview From Sequencing INPUTS FASTQ Available for Download Reference Files User Constructed FASTQ List File Config File gotcloud align --conf file.config 1) Alignment: bwa/mosaik 2) BAM Processing: bamutil a) Merge multiple BAMs Makefiles (one per sample) b) Mark duplicates & recalibrate 3) Quality Control a) Visualization of quality measures: QPLOT b) Screen for contamination & sample swap: VerifyBamID OUTPUTS Processed BAM QC Results (one per sample) (one per sample)

What are all of the steps in GotCloud align? For each sample (using BWA): GotCloud Alignment Pipeline Overview Aren t you glad you don t have to worry about and manage each step?

User Created Input: FASTQ List File GotCloud needs to know about each FASTQ Where to find it Sample name Each sample can have multiple FASTQs 1 FASTQ only has a single sample Format Tab delimited Header line One line per single-end One line per paired-end

User Created Input: FASTQ List File * Spacing adjusted for easier reading (just a single tab between columns) Header Row

User Created Input: FASTQ List File * Spacing adjusted for easier reading (just a single tab between columns) Groups all FASTQs for a sample in a single BAM Header Row

User Created Input: FASTQ List File * Spacing adjusted for easier reading (just a single tab between columns) Groups all FASTQs for a sample in a single BAM Header Row Multiple FASTQs for 1 sample

User Created Input: FASTQ List File * Spacing adjusted for easier reading (just a single tab between columns) Header Row. means single-end filename means 2nd in pair Groups all FASTQs for a sample in a single BAM Multiple FASTQs for 1 sample

FASTQ List File: Optional RG Fields Field Name Description Default RGID Read Group ID - unique for each run Derived from first line of FASTQ or incrementing numbers LIBRARY Separates FASTQs for a sample that were prepared separately SAMPLE PLATFORM Sequencing Platform ILLUMINA CENTER Sequencing Center Useful if multiple centers unknown MERGE_NAME Group FASTQs into a single BAM file Only need if want multiple BAMs for a sample SAMPLE

User Created Input: Configuration

User Created Input: Configuration

GotCloud Quality Control: Sample Contamination/Swap (by VerifyBamID) Genotype-free estimate of contamination 0-1 scale, the lower, the better FREEMIX column < 0.03 http://genome.sph.umich.edu/wiki/verifybamid# A_guideline_to_interpret_output_files Estimate of contamination with genotypes 0-1 scale, the lower, the better CHIPMIX column We don t have this in our tutorial

GotCloud Quality Control: Quality Metrics (by QPLOT).stats file contains metrics, including mapping rate, coverage, % high quality bases.r file that generates a.pdf of plots Empirical vs reported Phred score

Try it yourself http://genome.sph.umich.edu/wiki/seqshop: _Sequence_Mapping_and_Assembly_Practical Interested in GotCloud? http://genome.sph.umich.edu/wiki/gotcloud Join the mailing list: http://groups.google.com/group/gotcloud