Image Analysis and Base Calling Sarah Reid FAS

Image Analysis and Base Calling Sarah Reid FAS For Research Use Only. Not for use in diagnostic procedures. 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cbot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iscan, iselect, MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the US and/or other countries. All other names, logos, and other trademarks are the property of their respective owners.

Learning Objectives By the end of this lesson, you will be able to: Define the components of primary data analysis - Control Software - Real-Time Analysis Describe the image analysis steps Describe the base calling and filtering processes 2

Analysis Overview Analysis Type Software Outputs Control Software Images, Intensities and Base Calls Analysis Software Alignments, Variant Detection Visualization Software Annotation, Filtering, Reports 3

Drastic increase in speed Memory Based RTA2 Processing Most data stored in RAM RTA 2 does not display in a command prompt window Runs in background If issues are encountered during run: Re-hyb before turn around is possible Low risk of read 2 failures RTA cannot be restarted 4

Primary Data Analysis Workflow Template Generation Intensity Extraction Intensity Normalization Phasing Estimate Base Calling and Filtering Quality Scoring Image Analysis Base Calling 5

What is a Cluster? Clusters are bright spots on an image Each cluster represents thousands of copies of the same DNA strand in a 1 2 micron spot 6

2-Channel Sequencing Imaging Cycles 2 Channel Sequencing only requires 2 images and hence all the data from the 4 DNA bases are encoded in these 2 images. Uses the same sequencing by synthesis (SBS) method as 4 channel sequencing but allows more efficient acquisition of the data. Used by NextSeq Series and MiniSeq T C A G Channel Green Red Green and Red Dark - Neither 8

Image Analysis Overview Images or Image Data RunInfo.xml Image Analysis Template Generation Extract Intensities Normalize Intensities.clocs.cif or intensity data Locate clusters Correct Signal 9

Locating Clusters Template Generation HiSeq 2500, NextSeq, MiniSeq, MiSeq Identify the location of every cluster and create a map www.lookandlearn.com/blog/5793/constellations-mapping-stars/ 10

Template Generation NextSeq and MiniSeq red channel green channel Cycle 1 Cycle 2 C T Instrument pauses while template map is calculated Cycle 3 Cycle 4 Cycle 5 A G A Template Generation Cycles: 5 At least 1 non-g is required 13

Multi-Cycle Detection of Cluster Positions CYCLE 1 neighboring clusters are difficult to resolve when they contain the same base Red channel CYCLE 2 Using multiple cycles of sequencing increases the chance that neighboring clusters contain different bases, which helps to resolve clusters Green channel Red channel 16

Intensity Extraction Background Subtraction Compute background Compute signal for each cluster Subtract background from each cluster Clusters are not perfect circles. RTA sees them as shown Background is calculated by averaging the intensity of the dimmest pixels in a region Cluster intensity is extracted from the brightest part of each cluster Background is subtracted from signal of each cluster 17

Over Clustering Affects Background Subtraction When a flow cell is over clustered, RTA cannot identify pixels that represent true background Instead of subtracting background, RTA subtracts signal. This results in low intensities 18

Spatial Normalization Accounts for bright and dark areas of image Matches 95th percentiles(p95) of extracted intensities in regions Ranked Area 1 Intensities 19 P95 100 98 94 90 89 89 87 85 82 79 74 67 66 65 55 44 23 20 10 4 Ranked Area 2 Intensities P95 90 89 89 77 75 72 69 64 57 56 55 45 44 23 20 18 14 12 9 3 Normalized Area 2 Intensities 99 98 98 85 84 79 76 71 63 62 61 50 48 25 22 20 15 13 10 3

2 Color Base Calling Normalization Scale all intensities so their P05 and P95 intensities represent 0 and 1 Background subtracted, Spatial Normalized Intensities 20 P95 =1 P05 =0 99 98 97 85 84 79 76 71 63 62 61 50 48 25 22 20 15 13 10 3 Base call Normalized Intensities 1 1 0.99 0.87 0.86 0.89 0.76 0.72 0.69 0.63 0.62 0.51 0.49 0.25 0.22 0.20 0.15 0.13 0 0 1 0 1

Base Calling 23

Base Calling Input and Output.clocs Base Calling.cif or Intensity data Phasing & Prephasing Estimate Base Calling and Filtering Quality Scoring.bcl One.bcl for each tile or lane at each cycle 24

Base Calling Phasing Correction within a single cluster of thousands of strands Phasing Prephasing T A A A A A C A 25

Empirical Phasing Correction Phasing Correction Parameter How much signal of the previous base is present? Prephasing Correction Parameter How much signal of the next base is present? Cycle N-1 Cycle N Cycle N+1 Subtract phasing and prephasing parameters Phase Corrected Cycle N 26

Green Intensity 2 Color Population-based Base Calling Scatterplot of 4 distinct populations (nucleotides) is created Base calls are made according to which channel is on (1) or off (0) for each cluster according to (x, y): - (1, 0) C - (0, 1) T - (1, 1) A - (0, 0) G 1 T G C 0 1 Red Intensity 30

2-Color Calculating Clusters Passing Filter Pass filter is: C 1 D 1 D1 D 2 The ratio of the sum of the most prominent and second most prominent population intensities Calculated for each cluster over the first 25 bases of the sequence Filters cluster by signal purity Removes overlapping and low-intensity clusters 1 D 1 = 0.2 + + C=0.8 D 1 = 0.6 D 2 = 0.8 C=0.5 D 2 = 0.6 + + 0 1 Passing Chastity value: 0.63 31

Quality Scoring Quality Scores Estimate the probability of an error in base calling based on a quality model Quality model Includes quality predictors of single bases, neighboring bases and reads Reported After Clusters passing filter calculation has completed cycle 25 ASCII Quality Score Probability of Incorrect Based Call Base Call Accuracy Q- score + 1 in 10 90% Q10 5 1 in 100 99% Q20? 1 in 1000 99.9% Q30 I 1 in 10000 99.99% Q40 32

Quality Score Binning Store twice the amount of data with the same amount of storage Q-scores are binned to reduce FASTQ size and BCL size Q- Score Range Quality Score Bin Compression* MiniSeq NextSeq HiSeq 2500 HiSeq 3000/ 4000 2 9 2 2 6 7 10 19 14 14 15 11 20 24 21 21 22 22 25 29 28 27 27 27 30 34 32 32 33 32 35 39 37 36 37 37 >40 40 NA 40 43 33 *Bins are subject to change upon new software release

Percent Aligned and Error Rate Only calculated if PhiX is spiked in % Aligned is the percent of clusters in which the first 25 cycles align to the PhiX reference genome Error rate is the rate of mis-matches between sequencing data and PhiX reference genome Structure of PhiX Capsid en.wikipedia.org/wiki/phi_x_174 34

Detailed NextSeq Data Analysis Workflow Estimate Phasing/ Prephasing Error Rate Final Cluster Density Calculated Clusters PF Final Cluster Density Reported Cycle 1 5 25 Build Ref.locs Align to PhiX Extract 1 5 Intensity data Base Call.bcl Quality Score.bcl 37 37

Resources Sequence Analysis Viewer User Guide MiSeq User Guide NextSeq User Guide MiniSeq User Guide HiSeq 2500 User Guide HiSeq 3000/4000 User Guide HiSeq X User Guide 41

Questions? 43