NGS%sequencing%read%formats% Random#DNA#fragment#% sequencing%with%illumina# Extending%the%FASTA%format% 9/11/14%

Similar documents
István'Albert' ' Biochemistry$and$Molecular$Biology$$ and$bioinforma;cs$consul;ng$center$ $ Penn$State$

2013%&%BMMB%597D:%Analyzing%Next%Genera<on%Sequencing%Data% % %Week%2,%Lecture%3% István'Albert' ' Bioinforma<cs%Consul<ng%Center% % Penn%State%

Orifice Flow Meter

MCA-7724 MOTION AXIS ROUTER

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

a translation by c units a translation by c units

Chapter Three. Digital Components

MATH10001 Mathematical Workshop. Graphs, Trees and Algorithms Part 2. Trees. From Trees to Prüfer Codes

CSC 284/484 Advanced Algorithms - applied homework 0 due: January 29th, 11:59pm EST

VEX/RobotC Video Trainer Assignments

Unit 3.2: Fractions, Decimals and Percent Lesson: Comparing and Ordering Fractions and Decimals

W.D.Gann Calculator Available Tools:

HOMEWORK 1: Solutions

How many cereal circles are needed to cover the bottom (area) of the bowl? Explain your reasoning. Give your answer within a range of 5 circles.

Representation of Numbers and Arithmetic in Signal Processors

Binary Representations and Arithmetic

Chapter 23. Geometrical Optics: Mirrors and Lenses and other Instruments

Solve the matrix equation AX B for X by using A.(1-3) Use the Inverse Matrix Calculator Link to check your work

Homework 2. Out: 09/23/16 Due: 09/30/16 11:59pm UNIVERSITY OF MARYLAND DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

Integrated Math 1: Homework #2 Answers (Day 1)

Lecture 07: Private-key Encryption. Private-key Encryption

Automated Enumeration of Pattern Avoiding Permutations

Warm Up. Factor the following numbers and expressions. Multiply the following factors using either FOIL or Box Method

Deconvolution Networks

Math 4: Advanced Algebra Ms. Sheppard-Brick B Quiz Review Sections and

LZSS Circuit By Eliat Avidan CPE 405

Lesson 9 Reflections Learning Targets :

Calibration. Reality. Error. Measuring device. Model of Reality Fall 2001 Copyright R. H. Taylor 1999, 2001

H264 Encoder Codec. API Specification. 04/27/2017 Revision SOC Technologies Inc.

ECE2049 HW #1-- C programming and Binary Number Representations (DUE 1/19/2018 At the BEGINNING of class)

Tangent line problems

Homework Assignment #3 Due Oct 3:50 PM

ECE2049 HW #1-- C programming and Binary Number Representations (DUE Friday 8/31/2018 At the BEGINNING of class)

The exercises and answer for programming language class

Calibration. Reality. Error. Measuring device. Model of Reality Fall 2001 Copyright R. H. Taylor 1999, 2001

ECE 2020B Fundamentals of Digital Design Spring problems, 6 pages Exam Two Solutions 26 February 2014

Special Directions for this Test

Information Security Management System (ISMS) ISO/IEC 27001:2013

SPECWELL Monoculars. Table of contents

ECE2049 HW #1-- C programming and Binary Number Representations (DUE 9/1/2017 At the BEGINNING of class)

Chapter 12 Transformations: Shapes in Motion

Question II. A) Because there will be additional actions supporting such conditions (at least Noops), relaxing the mutex propagation.

Generating Data Reports from BlackBelt & I-28

Homework Set 2- Class Design

3.7.2 Transformations of Linear and Exponential Functions

Hybrid Electronics Laboratory

Date Lesson TOPIC Homework. The Intersection of a Line with a Plane and the Intersection of Two Lines

AFTRS 2017 EQUIPMENT BOOKING SYSTEM MANUAL.

Monitoring with Magnet

Flow of Control. Flow of control The order in which statements are executed. Transfer of control

V2 2/4/ Ch Programming in C. Flow of Control. Flow of Control. Flow of control The order in which statements are executed

Virtual University of Pakistan

A Technique for Enabling and Supporting Debugging of Field Failures

About webmethods FIX Module Sample Package

Loads. Lecture 12: PRISMATIC BEAMS

Memory Management. Memory Management... Memory Management... Interface to Dynamic allocation

Homework 6: Printed Circuit Board Layout Design Narrative

2/22/2018. Warmup 2/ሺ Don t turn your rotations into reflections. Rotations around OTHER points than the origin

Some announcements. Announcements for game due (via ) on Wednesday, March 15 Homework 6 due on March 15 Exam 3 on March 17

CS 2316 Pair 1: Homework 3 Enigma Fun Due: Wednesday, February 1st, before 11:55 PM Out of 100 points

EECS 591 DISTRIBUTED SYSTEMS. Manos Kapritsos Winter 2018

How to Run the CASPiE GCMS

1.1: Basic Functions and Translations

Preliminary Mathematics Extension 1

Unit 1 Lesson 5: Equivalent Expressions

Conducted Emissions, FCC Part 15

How many DES keys, on the average, encrypt a particular plaintext block to a particular ciphertext block?

ECE 2020B Fundamentals of Digital Design Spring problems, 6 pages Exam Two 26 February 2014

EECS 281 Homework 4 Key Fall 2004

Finite Math - J-term Homework. Section Inverse of a Square Matrix

The ChIP-seq quality Control package ChIC: A short introduction

CSE Computer Architecture I Fall 2011 Homework 07 Memory Hierarchies Assigned: November 8, 2011, Due: November 22, 2011, Total Points: 100

H264 Encoder Codec. API Specification 12/20/2017. Revision SOC Technologies Inc.

ENEE x Digital Logic Design. Lecture 3

CSE 123: Computer Networks

CE1911 LECTURE FSM DESIGN PRACTICE DAY 2

Transla'on, Protec'on, and Virtual Memory. 2/25/16 CS 152 Sec'on 6 Colin Schmidt

Algorithms and Conditionals

MPEG-4: Simple Profile (SP)

The General Equivalence Mappings. GEM Files Summary Sheet

CS4610/CS5335: Homework 1

SOLUTIONS FOR HOMEWORK # 1 ANSWERS TO QUESTIONS

Lecture 10. Vector Network Analyzers and Signal Flow Graphs

In the name of Allah. the compassionate, the merciful

Communication Process (1)

CSC 1051 Data Structures and Algorithms I

LSN 4 Boolean Algebra & Logic Simplification. ECT 224 Digital Computer Fundamentals. Department of Engineering Technology

Transforming Coordinates

Introduction to Computer Science. Homework 1

1 General Information

InsectJ: A Generic Instrumentation Framework for Collecting Dynamic Information within Eclipse

Design Elements and Principles. Tutorial Questions

Maximum flow problem CE 377K. March 3, 2015

Computability Theory XI

A CutEr Tool. Kostis Sagonas

CSc 520. Gofer III. Accumulative Recursion. Accumulative Recursion... Stack Recursion. Principles of Programming Languages. Christian Collberg

Keithley 2100 v1.06 Firmware Release Notes. Contents. General Information. Supported Models. Installation Instructions

ECE/CS 252 Fall 2011 Homework 4 (25 points) // Due in Lecture Mon Oct. 17, 2011

Package savr. R topics documented: October 12, 2016

CS 320 Week 8 Homework Due W 3/27 11:59pm. incm 5 >>= (\x -> incm x >>= (\y -> incm y) )

Transcription:

9/11/14 NGSsequencingreadformats BMMB#852:AppliedBioinforma4cs Week3,Lecture6 István#Albert# # Bioinforma4csConsul4ngCenter PennState,2014 Reads:shortsequencesproducedbythe instrument Illumina!FastQformat(.fastqor.fq) Solid!colorspacefasta(.xsqor.csfasta+.qual) 454!standardflowgramformat(.sff) Random#DNA#fragment# sequencingwithillumina# ExtendingtheFASTAformat Forward Reverse Thesequencesaremeasurements Fragmenta4on ForeachfragmentS>adapterliga4onS>separatebystrandsS>somepiecesgetsequenced + + Sequencer Thereneedstobeawaytoassociatequality measurestoeachbase FASTQ!.fq,.fastq(FASTAwithquali4es) Singleendsequencing sequencingdirec4on sequencingreads 1

9/11/14 ThestructureoftheFASTQfile Four#lines#per#FASTQ#record# # 1. @indicatesthesequenceiden4fier 2. Thesequencecontentoftheread 3. +op4onallyrepeatthesequenceid(o\enle\empty) 4. Sequencequalitystring Encodings Anencodingisatransforma4onfromone representa4ontoanother Theinforma4onisnotchanged Theop4miza4onmethodchanges i.e:pigla4nisatypeofencoding Paper:#TheSangerFASTQfileformatforsequenceswithqualityscores, andthesolexa/illuminafastqvariantssnucl.&acids&res.&(2010)&38&(6):&176771771.& Ordinal(numerical) valueofacharacter(ord) Encoding Onecharacter!onebytespace ABCa=4byteslong 65666797=11byteslong Good:#threecharactersareturnedintoone,savesspace Bad:#notreadable,hindersunderstanding 2

9/11/14 Remappinganencoding Problems:onlysometypesofcharacterscanbeprinted. Sotheencodingmuststartatacharacterthatcanbeprinted, thatwon tbezeroanditneedstorepresentzero Saycharacter A hasacodeof65.ifweweretochoose A astheminimumofourscalethenweneedtoshi\thescale by65 QualityScores Aqualityscoreisanumberthatusuallyhaslimits,a low(say0)toahigh(say40) Aqualityscorerepresentsanerrorprobability. Itcharacterizesasinglestepoftheprocessandthe NOTtheen4reexperimentalprocedure Qualityscoresareusedtorepresentbasecalling accuracy,alignmentaccuracyandotherprobabili4es PHREDQualityScores Connec4ngaqualityscoretoaprobability ForaqualityscoreQtheerrorprobabilityis P#=#10# Q/10# Examples: Q#=#10!P#=#10# 1 #=#1/10#=#0.1#=>#P#=#10# Q#=#40!P#=#10# 4 #=#1/10000#=##0.0001#=>#P#=#0.01 Therearemul4pleencodings:shi\s Illuminausedtoswitcharoundtheencoding everyonceinawhile. FinallytheyseiledontheSangerfor encoding/phredqualityrepresenta4on.since 2011orso. Thereareplentyofdatasets/toolsoutthere thatmayusedifferentencodings! 3

9/11/14 SangerEncoding(shi\by33) QualityValuerangebetween0and93 Startthescaleatcharacter33 Endthescaleatcharacter33+93=126 Illumina1.3encoding(shi\by64) (obsoletebuts4llo\enobservedinthewild) Qualityrangebetween0to62 Startscaleatcharacter64 Endscaleatcharacter64+62=126 (currentlymostinstrumentsonlyproduce quali4esintherangeis0to40) FASTQencodingformats Understandingencodings Ifyouunderstandhowtoreadthisyou llunderstandthefastqformat 4

9/11/14 Moreinforma4onmaybepresent Illuminainstrumenta4on specificinforma4on:lane,4le,spot IlluminaFASTQheaderformat DeSfactostandardforproducingsequencingreads.Thevastmajorityofcurrenttools expectthisformat. StoringdatainSRAremovestheextraheaderinforma4onintheFASTQrecord!Thatis unfortunate!someinforma4onisnowlostandavailableonlytotheoriginalauthors! 1. Instrumentname:HWIUST1342#(uniqueforeverysequencer) 2. Runid:96# 3. Flowcellid:H0NP9ADXX(uniqueforeveryflowcell) 4. Flowcelllane:2# 5. Tilenumberwithintheflowcell:1115# 6. XScoordinateoftheclusterinthe4le:13393# 7. YScoordinateoftheclusterinthe4le:59201 Morefieldsaremayalsobepresent(notshownabove): 1. Matepair1or2 2. Flag:YorN controlbits,indexsequences,usuallydefinedintheilluminamanuals# Homework6 WhatcharactersintheSangerencoding representbasecallingerrorprobabili4esof: 100 0.01 0.001 CreateaSangerencodedFASTQfilethatasingle recordwiththesequenceatgcandhasthe quali4esof40,35,36and32# 5