9/11/14 NGSsequencingreadformats BMMB#852:AppliedBioinforma4cs Week3,Lecture6 István#Albert# # Bioinforma4csConsul4ngCenter PennState,2014 Reads:shortsequencesproducedbythe instrument Illumina!FastQformat(.fastqor.fq) Solid!colorspacefasta(.xsqor.csfasta+.qual) 454!standardflowgramformat(.sff) Random#DNA#fragment# sequencingwithillumina# ExtendingtheFASTAformat Forward Reverse Thesequencesaremeasurements Fragmenta4on ForeachfragmentS>adapterliga4onS>separatebystrandsS>somepiecesgetsequenced + + Sequencer Thereneedstobeawaytoassociatequality measurestoeachbase FASTQ!.fq,.fastq(FASTAwithquali4es) Singleendsequencing sequencingdirec4on sequencingreads 1
9/11/14 ThestructureoftheFASTQfile Four#lines#per#FASTQ#record# # 1. @indicatesthesequenceiden4fier 2. Thesequencecontentoftheread 3. +op4onallyrepeatthesequenceid(o\enle\empty) 4. Sequencequalitystring Encodings Anencodingisatransforma4onfromone representa4ontoanother Theinforma4onisnotchanged Theop4miza4onmethodchanges i.e:pigla4nisatypeofencoding Paper:#TheSangerFASTQfileformatforsequenceswithqualityscores, andthesolexa/illuminafastqvariantssnucl.&acids&res.&(2010)&38&(6):&176771771.& Ordinal(numerical) valueofacharacter(ord) Encoding Onecharacter!onebytespace ABCa=4byteslong 65666797=11byteslong Good:#threecharactersareturnedintoone,savesspace Bad:#notreadable,hindersunderstanding 2
9/11/14 Remappinganencoding Problems:onlysometypesofcharacterscanbeprinted. Sotheencodingmuststartatacharacterthatcanbeprinted, thatwon tbezeroanditneedstorepresentzero Saycharacter A hasacodeof65.ifweweretochoose A astheminimumofourscalethenweneedtoshi\thescale by65 QualityScores Aqualityscoreisanumberthatusuallyhaslimits,a low(say0)toahigh(say40) Aqualityscorerepresentsanerrorprobability. Itcharacterizesasinglestepoftheprocessandthe NOTtheen4reexperimentalprocedure Qualityscoresareusedtorepresentbasecalling accuracy,alignmentaccuracyandotherprobabili4es PHREDQualityScores Connec4ngaqualityscoretoaprobability ForaqualityscoreQtheerrorprobabilityis P#=#10# Q/10# Examples: Q#=#10!P#=#10# 1 #=#1/10#=#0.1#=>#P#=#10# Q#=#40!P#=#10# 4 #=#1/10000#=##0.0001#=>#P#=#0.01 Therearemul4pleencodings:shi\s Illuminausedtoswitcharoundtheencoding everyonceinawhile. FinallytheyseiledontheSangerfor encoding/phredqualityrepresenta4on.since 2011orso. Thereareplentyofdatasets/toolsoutthere thatmayusedifferentencodings! 3
9/11/14 SangerEncoding(shi\by33) QualityValuerangebetween0and93 Startthescaleatcharacter33 Endthescaleatcharacter33+93=126 Illumina1.3encoding(shi\by64) (obsoletebuts4llo\enobservedinthewild) Qualityrangebetween0to62 Startscaleatcharacter64 Endscaleatcharacter64+62=126 (currentlymostinstrumentsonlyproduce quali4esintherangeis0to40) FASTQencodingformats Understandingencodings Ifyouunderstandhowtoreadthisyou llunderstandthefastqformat 4
9/11/14 Moreinforma4onmaybepresent Illuminainstrumenta4on specificinforma4on:lane,4le,spot IlluminaFASTQheaderformat DeSfactostandardforproducingsequencingreads.Thevastmajorityofcurrenttools expectthisformat. StoringdatainSRAremovestheextraheaderinforma4onintheFASTQrecord!Thatis unfortunate!someinforma4onisnowlostandavailableonlytotheoriginalauthors! 1. Instrumentname:HWIUST1342#(uniqueforeverysequencer) 2. Runid:96# 3. Flowcellid:H0NP9ADXX(uniqueforeveryflowcell) 4. Flowcelllane:2# 5. Tilenumberwithintheflowcell:1115# 6. XScoordinateoftheclusterinthe4le:13393# 7. YScoordinateoftheclusterinthe4le:59201 Morefieldsaremayalsobepresent(notshownabove): 1. Matepair1or2 2. Flag:YorN controlbits,indexsequences,usuallydefinedintheilluminamanuals# Homework6 WhatcharactersintheSangerencoding representbasecallingerrorprobabili4esof: 100 0.01 0.001 CreateaSangerencodedFASTQfilethatasingle recordwiththesequenceatgcandhasthe quali4esof40,35,36and32# 5