Image Analysis and Base Calling Sarah Reid FAS

Similar documents
mtdna Variant Processor v1.0 BaseSpace App Guide

Sequencing Analysis Viewer Software User Guide

HiSeq Instrument Software Release Notes

Illumina Experiment Manager User Guide

BaseSpace User Guide. Supporting the NextSeq, MiSeq, and HiSeq Sequencing Systems FOR RESEARCH USE ONLY

Local Run Manager Amplicon Analysis Module Workflow Guide

Local Run Manager Resequencing Analysis Module Workflow Guide

iscan System Site Prep Guide

BaseSpace Onsite v2.1 HT System Guide

MiSeq Reporter Amplicon DS Workflow Guide

Isaac Enrichment v2.0 App

Illumina Next Generation Sequencing Data analysis

MiSeq Reporter TruSeq Amplicon Workflow Guide

EcoStudy Software User Guide

Indexed Sequencing. Overview Guide

RNA Sequencing with TopHat Alignment v1.0 and Cufflinks Assembly & DE v1.1 App Guide

MiSeq System User Guide

Illumina Experiment Manager: Narration Transcript

MiSeq System User Guide

GenomeStudio Software Release Notes

Indexed Sequencing. Overview Guide

BaseSpace User Guide FOR RESEARCH USE ONLY

Miseq spec, process and turnaround times

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

HiScan SQ System User Guide

HiSeq X System Lab Setup and Site Prep Guide

Package savr. R topics documented: October 12, 2016

Package savr. R topics documented: March 2, 2018

EXERCISE: GETTING STARTED WITH SAV

KaryoStudio v1.4 User Guide

Sequencing set-up guidelines for NGS libraries prepped with Agilent NGS kits

BlueFuse Multi v4.4 Software Guide

RNA Sequencing with TopHat and Cufflinks

MiSeqDx Reference Guide

MiSeq Reporter Software Reference Guide for IVD Assays

Illumina GA. later. RTA1.9. very number. older style

Illumina LIMS User Guide

Sequence Genotyper Reference Guide

AutoLoader 2.x User Guide

Using Pipeline Output Data for Whole Genome Alignment

Lessons Learned during Illumina s Secure DevOps Transition Kenneth G. Hartman Associate Director Cloud Products Security

BaseSpace - MiSeq Reporter Software v2.4 Release Notes

MiSeq Reporter TruSight Tumor 15 Workflow Guide

Using Genome Analyzer Sequencing Control Software Version 2.5

TruSight HLA Assign 2.1 RUO Software Guide

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

README _EPGV_DataTransfer_Illumina Sequencing

Sequence Data Quality Assessment Exercises and Solutions.

BlueFuse Multi v4.4 Installation Guide

Tutorial. Small RNA Analysis using Illumina Data. Sample to Insight. October 5, 2016

Small RNA Analysis using Illumina Data

Contact: Raymond Hovey Genomics Center - SFS

Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab

Install Notes HCS RTA SAV Recipe Fragments (RF) BaseSpace Broker For HiSeq 2500, 2000, or 1500 Instruments

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Tutorial. Find Very Low Frequency Variants With QIAGEN GeneRead Panels. Sample to Insight. November 21, 2017

Designing Custom GoldenGate Genotyping Assays

BIT 815: Analysis of Deep DNA Sequencing Data

Introduction. Library quantification and rebalancing. Prepare Library Sequence Analyze Data. Library rebalancing with the iseq 100 System

Guidelines for sequencing SCC libraries All libraries made after are V3 libraries

HiScanSQ System Site Preparation Guide

Mar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Overview of Generated Files Courtesy of Dr. Jon Keebler

Mapping Reads to Reference Genome

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Infinium iselect Custom Genotyping Assays Guidelines for using the DesignStudio Microarray Assay Designer software to create and order custom arrays.

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Molecular Identifier (MID) Analysis for TAM-ChIP Paired-End Sequencing

USING BRAT UPDATES 2 SYSTEM AND SPACE REQUIREMENTS

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1

NA12878 Platinum Genome GENALICE MAP Analysis Report

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

High-throughout sequencing and using short-read aligners. Simon Anders

Molecular Identifier (MID) Analysis for TAM-ChIP Paired-End Sequencing

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-

Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Designing Custom GoldenGate Genotyping Assays

BaseSpace Variant Interpreter Release Notes

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Eval: A Gene Set Comparison System

Automatic Techniques for Gridding cdna Microarray Images

Data Walkthrough: Background

Sequence Mapping and Assembly

User Guide: Illumina sequencing technologies Sequence-ready libraries

Supplementary information: Detection of differentially expressed segments in tiling array data

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

SeqScape Software Version 2.7

User Bulletin. ABI PRISM GeneScan Analysis Software for the Windows NT Operating System. Introduction DRAFT

GPS Explorer Software For Protein Identification Using the Applied Biosystems 4700 Proteomics Analyzer

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements

!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468,

Exeter Sequencing Service

Best Practices for Illumina Genome Analysis Based on Huawei OceanStor 9000 Big Data Storage System. Huawei Technologies Co., Ltd.

Texture. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors

NGS Data and Sequence Alignment

ABI PRISM 3100-Avant and 3100 Data Collection v2.0 Software Frequently Asked Questions (FAQ)

1. PURPOSE: to describe a standardized procedure for Illumina MiSeq data quality control (QC) before upload to PulseNet Central

Algorithm User Guide:

Transcription:

Image Analysis and Base Calling Sarah Reid FAS For Research Use Only. Not for use in diagnostic procedures. 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cbot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iscan, iselect, MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the US and/or other countries. All other names, logos, and other trademarks are the property of their respective owners.

Learning Objectives By the end of this lesson, you will be able to: Define the components of primary data analysis - Control Software - Real-Time Analysis Describe the image analysis steps Describe the base calling and filtering processes 2

Analysis Overview Analysis Type Software Outputs Control Software Images, Intensities and Base Calls Analysis Software Alignments, Variant Detection Visualization Software Annotation, Filtering, Reports 3

Drastic increase in speed Memory Based RTA2 Processing Most data stored in RAM RTA 2 does not display in a command prompt window Runs in background If issues are encountered during run: Re-hyb before turn around is possible Low risk of read 2 failures RTA cannot be restarted 4

Primary Data Analysis Workflow Template Generation Intensity Extraction Intensity Normalization Phasing Estimate Base Calling and Filtering Quality Scoring Image Analysis Base Calling 5

What is a Cluster? Clusters are bright spots on an image Each cluster represents thousands of copies of the same DNA strand in a 1 2 micron spot 6

2-Channel Sequencing Imaging Cycles 2 Channel Sequencing only requires 2 images and hence all the data from the 4 DNA bases are encoded in these 2 images. Uses the same sequencing by synthesis (SBS) method as 4 channel sequencing but allows more efficient acquisition of the data. Used by NextSeq Series and MiniSeq T C A G Channel Green Red Green and Red Dark - Neither 8

Image Analysis Overview Images or Image Data RunInfo.xml Image Analysis Template Generation Extract Intensities Normalize Intensities.clocs.cif or intensity data Locate clusters Correct Signal 9

Locating Clusters Template Generation HiSeq 2500, NextSeq, MiniSeq, MiSeq Identify the location of every cluster and create a map www.lookandlearn.com/blog/5793/constellations-mapping-stars/ 10

Template Generation NextSeq and MiniSeq red channel green channel Cycle 1 Cycle 2 C T Instrument pauses while template map is calculated Cycle 3 Cycle 4 Cycle 5 A G A Template Generation Cycles: 5 At least 1 non-g is required 13

Multi-Cycle Detection of Cluster Positions CYCLE 1 neighboring clusters are difficult to resolve when they contain the same base Red channel CYCLE 2 Using multiple cycles of sequencing increases the chance that neighboring clusters contain different bases, which helps to resolve clusters Green channel Red channel 16

Intensity Extraction Background Subtraction Compute background Compute signal for each cluster Subtract background from each cluster Clusters are not perfect circles. RTA sees them as shown Background is calculated by averaging the intensity of the dimmest pixels in a region Cluster intensity is extracted from the brightest part of each cluster Background is subtracted from signal of each cluster 17

Over Clustering Affects Background Subtraction When a flow cell is over clustered, RTA cannot identify pixels that represent true background Instead of subtracting background, RTA subtracts signal. This results in low intensities 18

Spatial Normalization Accounts for bright and dark areas of image Matches 95th percentiles(p95) of extracted intensities in regions Ranked Area 1 Intensities 19 P95 100 98 94 90 89 89 87 85 82 79 74 67 66 65 55 44 23 20 10 4 Ranked Area 2 Intensities P95 90 89 89 77 75 72 69 64 57 56 55 45 44 23 20 18 14 12 9 3 Normalized Area 2 Intensities 99 98 98 85 84 79 76 71 63 62 61 50 48 25 22 20 15 13 10 3

2 Color Base Calling Normalization Scale all intensities so their P05 and P95 intensities represent 0 and 1 Background subtracted, Spatial Normalized Intensities 20 P95 =1 P05 =0 99 98 97 85 84 79 76 71 63 62 61 50 48 25 22 20 15 13 10 3 Base call Normalized Intensities 1 1 0.99 0.87 0.86 0.89 0.76 0.72 0.69 0.63 0.62 0.51 0.49 0.25 0.22 0.20 0.15 0.13 0 0 1 0 1

Base Calling 23

Base Calling Input and Output.clocs Base Calling.cif or Intensity data Phasing & Prephasing Estimate Base Calling and Filtering Quality Scoring.bcl One.bcl for each tile or lane at each cycle 24

Base Calling Phasing Correction within a single cluster of thousands of strands Phasing Prephasing T A A A A A C A 25

Empirical Phasing Correction Phasing Correction Parameter How much signal of the previous base is present? Prephasing Correction Parameter How much signal of the next base is present? Cycle N-1 Cycle N Cycle N+1 Subtract phasing and prephasing parameters Phase Corrected Cycle N 26

Green Intensity 2 Color Population-based Base Calling Scatterplot of 4 distinct populations (nucleotides) is created Base calls are made according to which channel is on (1) or off (0) for each cluster according to (x, y): - (1, 0) C - (0, 1) T - (1, 1) A - (0, 0) G 1 T G C 0 1 Red Intensity 30

2-Color Calculating Clusters Passing Filter Pass filter is: C 1 D 1 D1 D 2 The ratio of the sum of the most prominent and second most prominent population intensities Calculated for each cluster over the first 25 bases of the sequence Filters cluster by signal purity Removes overlapping and low-intensity clusters 1 D 1 = 0.2 + + C=0.8 D 1 = 0.6 D 2 = 0.8 C=0.5 D 2 = 0.6 + + 0 1 Passing Chastity value: 0.63 31

Quality Scoring Quality Scores Estimate the probability of an error in base calling based on a quality model Quality model Includes quality predictors of single bases, neighboring bases and reads Reported After Clusters passing filter calculation has completed cycle 25 ASCII Quality Score Probability of Incorrect Based Call Base Call Accuracy Q- score + 1 in 10 90% Q10 5 1 in 100 99% Q20? 1 in 1000 99.9% Q30 I 1 in 10000 99.99% Q40 32

Quality Score Binning Store twice the amount of data with the same amount of storage Q-scores are binned to reduce FASTQ size and BCL size Q- Score Range Quality Score Bin Compression* MiniSeq NextSeq HiSeq 2500 HiSeq 3000/ 4000 2 9 2 2 6 7 10 19 14 14 15 11 20 24 21 21 22 22 25 29 28 27 27 27 30 34 32 32 33 32 35 39 37 36 37 37 >40 40 NA 40 43 33 *Bins are subject to change upon new software release

Percent Aligned and Error Rate Only calculated if PhiX is spiked in % Aligned is the percent of clusters in which the first 25 cycles align to the PhiX reference genome Error rate is the rate of mis-matches between sequencing data and PhiX reference genome Structure of PhiX Capsid en.wikipedia.org/wiki/phi_x_174 34

Detailed NextSeq Data Analysis Workflow Estimate Phasing/ Prephasing Error Rate Final Cluster Density Calculated Clusters PF Final Cluster Density Reported Cycle 1 5 25 Build Ref.locs Align to PhiX Extract 1 5 Intensity data Base Call.bcl Quality Score.bcl 37 37

Analysis Overview Analysis Type Software Outputs Control Software Images, Intensities and Base Calls Analysis Software Alignments, Variant Detection Visualization Software Annotation, Filtering, Reports 39

Resources Sequence Analysis Viewer User Guide MiSeq User Guide NextSeq User Guide MiniSeq User Guide HiSeq 2500 User Guide HiSeq 3000/4000 User Guide HiSeq X User Guide 41

Questions? 43