Read Naming Format Specification

Similar documents
Lecture 12. Short read aligners

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

The SAM Format Specification (v1.3-r837)

INTRODUCTION AUX FORMATS DE FICHIERS

The SAM Format Specification (v1.3 draft)

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

SOLiD GFF File Format

Genomic Files. University of Massachusetts Medical School. October, 2014

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Ensembl RNASeq Practical. Overview

Genomic Files. University of Massachusetts Medical School. October, 2015

High-throughout sequencing and using short-read aligners. Simon Anders

Under the Hood of Alignment Algorithms for NGS Researchers

Bioinformatics in next generation sequencing projects

ChIP-seq (NGS) Data Formats

Sequence Alignment/Map Optional Fields Specification

NGS Data and Sequence Alignment

Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

NGS Data Visualization and Exploration Using IGV

v0.3.0 May 18, 2016 SNPsplit operates in two stages:

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting

Benchmarking of RNA-seq aligners

SMALT Manual. December 9, 2010 Version 0.4.2

RNA-seq. Manpreet S. Katari

Identiyfing splice junctions from RNA-Seq data

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

The SAM Format Specification (v1.4-r956)

Dindel User Guide, version 1.0

NGS Analysis Using Galaxy

m6aviewer Version Documentation

Mapping NGS reads for genomics studies

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

Practical Bioinformatics for Life Scientists. Week 4, Lecture 8. István Albert Bioinformatics Consulting Center Penn State

NGS Data Analysis. Roberto Preste

all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m TopHat2

Sequence Analysis Pipeline

Package Rbowtie. January 21, 2019

Alignment of Long Sequences

Tiling Assembly for Annotation-independent Novel Gene Discovery

NGS FASTQ file format

RNA- SeQC Documentation

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

Introduction to Galaxy

GROK User Guide August 28, 2013

Galaxy Platform For NGS Data Analyses

Goal: Learn how to use various tool to extract information from RNAseq reads.

BLAST & Genome assembly

Determining gapped palindrome density in RNA using suffix arrays

Long Read RNA-seq Mapper

Analyzing ChIP- Seq Data in Galaxy

Introduction to UNIX command-line II

ChIP-seq hands-on practical using Galaxy

Short Read Alignment. Mapping Reads to a Reference

A Nearest Neighbors Algorithm for Strings. R. Lederman Technical Report YALEU/DCS/TR-1453 April 5, 2012

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance

Scalable RNA Sequencing on Clusters of Multicore Processors

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

SSAHA2 Manual. September 1, 2010 Version 0.3

Manual of SOAPdenovo-Trans-v1.03. Yinlong Xie, Gengxiong Wu, Jingbo Tang,

Using Galaxy for NGS Analyses Luce Skrabanek

Fusion Detection Using QIAseq RNAscan Panels

Mapping reads to a reference genome

Rsubread package: high-performance read alignment, quantification and mutation discovery

From the Schnable Lab:

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Rsubread package: high-performance read alignment, quantification and mutation discovery

Subread/Rsubread Users Guide

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

Mapping Reads to Reference Genome

Package Rsubread. July 21, 2013

CRAM format specification (version 2.1)

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Introduction to Computer Science-103. Midterm

RCAC. Job files Example: Running seqyclean (a module)

Genetics 211 Genomics Winter 2014 Problem Set 4

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

Aligners. J Fass 21 June 2017

AgroMarker Finder manual (1.1)

Integrative Genomics Viewer. Prat Thiru

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

RASER: Reads Aligner for SNPs and Editing sites of RNA (version 0.51) Manual

Comparison between conventional read mapping methods and compressively-accelerated read mapping framework (CORA).

BIOINFORMATICS APPLICATIONS NOTE

Maize genome sequence in FASTA format. Gene annotation file in gff format

The steps to create Table1 using Microsoft Excel are illustrated on the following pages.

merantk Version 1.1.1a

The SAM Format Specification (v1.4-r994)

ASAP - Allele-specific alignment pipeline

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Briefly describe the purpose of the lexical and syntax analysis phases in a compiler.

SISTEMI EMBEDDED. Basic Concepts about Computers. Federico Baronti Last version:

Genome 373: Genome Assembly. Doug Fowler

Transcription:

Read Naming Format Specification Karel Břinda Valentina Boeva Gregory Kucherov Version 0.1.3 (4 August 2015) Abstract This document provides a standard for naming simulated Next-Generation Sequencing (Ngs) reads in order to make read simulators and mapper evaluation tools inter-compatible. FASTA Genome 1 Genome 2 Read simulation Mapper evaluation RNF FASTQ BAM RNF Mapper TXT/HTML encoding decoding Read simulator Reads Mapper Alignment evaluation Report tool Genome n 1 Terminologies and concepts Read tuple. A tuple of sequences (possibly overlapping) obtained from a sequencing machine from a single fragment of DNA. Reads. Members of a read tuple. For example, every paired-end read is a read tuple and both of its ends are individual reads in our notation. Segments. Substrings of a read which are spatially distinct in the reference. They correspond to individual lines in a Sam file [1]. Thus, each read has an associated chain of segments and we associate a read tuple with segments of all its reads. Remarks: - A single-end read consists of a single read with a single segment unless it comes from a region with genomic rearrangment - A paired-end read or a mate-pair read consists of two reads, each with one segment (under the same condition). - A strobe read consists of several reads. - A chimeric read (i.e., read corresponding to a genomic fusion, a long deletion, or a translocation) has at least two segments. - RNA-seq spliced reads are not considered to be spatially distinct. Simulator of NGS reads. A program which creates artificial simulated reads from one or more (possibly random) reference genomes. Evaluation tool of NGS mappers. A program which evaluates alignments of simulated Ngs reads with known original genomic positions. It assesses if each individual read is aligned correctly. Finally it usually creates overall statistics. 1-based coordinate system. A coordinate system where the first position has number 1 and intervals are closed (the same system is used by the Sam format [1]). 1

Coor 12345678901234-5678901234567890123456789 Source 1 - reference genome chr 1 ATGTTAGATAA-GATAGCTGTGCTAGTAGGCAGTCAGCCC chr 2 ttcttctggaa-gaccttctcctcctgcaaataaa Source 2 - generator of random sequences READS: r001 ATG-TAGATA -> r002/1 TTAGATAACGA -> r002/2 <- TCAG-CGGG r003/1 tgcaaataa -> r003/2 gaa-gacc-t -> r004 ATAGCT...TCAG -> r005 GTAGG -> <- agacctt <- TCGACACG r006 ATATCACATCATTAGACACTA (a) Simulated reads r. tuple LRN SRN r001 sim 1 (1,1,F,01,10) [single-end] #1 r002 sim 2 (1,1,F,04,14),(1,1,R,31,39) [paired-end] #2 r003 sim 3 (1,2,F,09,17),(1,2,F,25,33) [mate-pair] #3 r004 sim 4 (1,1,F,15,36) [spliced],c:[6=12n4=] #4 r005 sim 5 (1,1,R,15,22),(1,1,F,25,29),(1,2,R,05,11) [chimeric] #5 r006 rnd 6 (2,0,N,00,00) [random-contamination] #6 (b) Long and short read names Figure 1: Example of simulated read tuples and their corresponding Rnf names which can be used as read names in the final Fastq files: - single-end read (r001); - paired-end read (r002); - mate-pair read (r003); - spliced RNA-seq read (r004); - chimeric read (r005); - random contaminating read with unspecified coordinates (r006). 2

2 Read tuple names To every read tuple, two names are assigned: Short read name (SRN) and Long read name (LRN). SRN contains a hexademical unique read tuple ID prefixed by #. LRN consists of four parts delimited by double-underscore: i) a prefix (possibly containing expressive information for a user or a particular string for sorting or randomization of order of tuples), ii) the read tuple ID, iii) information about origins of all segments that constitute reads of the tuple, iv) a suffix containing arbitrary comments or extensions (for holding additional information). Preferred final read names are LRNs. If an LRN exceeds 255 (maximum allowed read length in Sam[1]), SRNs are used instead and a SRN LRN correspondence file must be created. 2.1 Read tuple ID It is a positive integer, which is unique within a single file with genomic data. These IDs are assigned continuously from 1. Zero is reserved for not available. 2.2 SRN Short Read Name \#([0-9a-f]+). SRN consists of read tuple ID prefixed by #. It is displayed as zero padded hexadecimals in lowercase such that all SRNs share the same string length within a single file. 2.3 LRN Long Read Name ^([!-?A-^`-~]*) ([0-9a-f]+) ([!-?A-^`-~]+) ([!-?A-^`-~]*)$. LRN consists of four double-underscore-delimited parts: i) prefix part, ii) read tuple ID, iii) iv) suffix part. Prefix part ^[!-?A-^`-~]*$ segmental part, It can be an empty string, a string containing expressive visual information for the user (e.g., for easy distinguishing random reads from the others), or a string used for randomization of read tuples (randomly taken prefix and read tuples sorted in lexicographical order). Length of all prefix parts within a single file must be equal. Read tuple ID part ^[0-9a-f]+$ It displays read tuple ID as hexadecimals in lowercase. All read tuple ID parts are zero padded such that they all share the same string length within a single file. Segmental part ^(?:(\([0-9FRN,]*\))(?:,(?!$) $))+$ Segmental part consists of one or more comma-delimited segments. 3

Segment ^\(([0-9]+),([0-9]+),([FRN]),([0-9]+),([0-9]+)\)$ Every segment is parenthesized and consists of five comma-delimited values: i) genome ID, ii) chromosome ID, iii) direction, iv) leftmost coordinate, and v) rightmost coordinate. Genome ID ID (positive integer) of the source genome (a randomly generated genome, a genome saved in a Fasta file, etc.) or zero for not available. Chromosome ID All numbers of genomes are displayed as decimals and they are zero padded such that they all share the same string length within a single file. ID (positive integer) of the source chromosome or zero for not available. All numbers of chromosomes are displayed as decimals and they are zero padded such that they all share the same string length within a single file. Direction IDs are assigned continuously from 1, order chromosomes is the same as in the file, where the genome is saved. In case of a random genome, zero should be used. Direction in the reference genome. F = forward direction R = reverse direction N = not available Leftmost coordinate Rightmost coordinate For random reads, N should be used. The leftmost coordinate of the segment in the reference in 1-based coordinate system or zero for not available. The rightmost coordinate of the segment in the reference in 1-based coordinate system or zero for not available. Suffix part ^(?:((?:[a-za-z0-9]+:){0,1})\[([!-?a-z\\^`-~]*)\](?:,(?!$) $))+$ It contains arbitrary number of comma-delimited comments and extensions in any order. Comment ^\[([!-?A-Z\\^`-~]*)\]$ Comments are displayed as square-bracketed strings. They can contain, e.g., information about the simulated technology or the program used for simulation. Extension ^([A-Za-z0-9]+):\[([!-?A-Z\\^`-~]*)\]$ An extension consist of an extension s code, a colon, and a square-bracketed extension s content. Extensions can supplement the basic set of information provided in segmental part. Some of them are part of this standard, see Section 4. 3 SRN LRN correspondence file To encode information about correspondence between SRN and LRN, a special file is created. Its file name is formed of prefix of the Fastq file(s) and.sl suffix. Examples: Read files reads_se.fq reads_se.fastq reads_pe.1.fq, reads_pe.2.fq SRN-LRN correspondence file reads_se.sl reads_se.sl reads_pe.sl 4

It is a tab delimited file with two columns (containing SRN and the corresponding LRN). File is sorted by read tuple ID. 4 Extensions Extensions can supplement the basic set of information provided in the segmental part (Section 2.3). C CIGAR strings Extension s code C Extension s content ^(?:([0-9]+[=XIDNSHPM]+)(?:,(?!$) $))+$ Specification The extension can be used to encode edit operations using Cigar (Compact Idiosyncratic Gapped Alignment Report) strings. Supported operations: letter operation comment = match X mismatch I insertion D deletion N skipping bases skipping intron regions in spliced mapping S soft clipping for cutting unaligned prefixes and suffixes H hard clipping for cutting unaligned prefixes and suffixes P padding unused padding in padded reference M match or mismatch deprecated, reserved for situations when distinguishing X vs. = is impossible Cigar strings should be provided in the same order as their corresponding segments in the segmental part (Section 2.3). Adjacent edit operations should be different. Example demonstration 004 (1,1,F,16,40),(1,1,R,140,150) C:[6=14N5=,11=],[spliced-paired-end-read] References [1] Li, H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078 2079. 5