Identiyfing splice junctions from RNA-Seq data

Similar documents
Ensembl RNASeq Practical. Overview

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery

Sequence Analysis Pipeline

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Bioinformatics in next generation sequencing projects

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

NGS Analysis Using Galaxy

RNA-seq. Manpreet S. Katari

Tiling Assembly for Annotation-independent Novel Gene Discovery

Single/paired-end RNAseq analysis with Galaxy

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughout sequencing and using short-read aligners. Simon Anders

The software and data for the RNA-Seq exercise are already available on the USB system

RASER: Reads Aligner for SNPs and Editing sites of RNA (version 0.51) Manual

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Tutorial: RNA-Seq analysis part I: Getting started

RNA-seq Data Analysis

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting

Lecture 12. Short read aligners

Genomic Files. University of Massachusetts Medical School. October, 2014

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

NGS FASTQ file format

Scalable RNA Sequencing on Clusters of Multicore Processors

Read Naming Format Specification

Galaxy Platform For NGS Data Analyses

Analysis of ChIP-seq data

Goal: Learn how to use various tool to extract information from RNAseq reads.

Genomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun

Package Rsubread. July 21, 2013

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

umicount Documentation

RNA-Seq Analysis With the Tuxedo Suite

Genomic Files. University of Massachusetts Medical School. October, 2015

TopHat, Cufflinks, Cuffdiff

Benchmarking of RNA-seq aligners

Circ-Seq User Guide. A comprehensive bioinformatics workflow for circular RNA detection from transcriptome sequencing data

all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m TopHat2

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Long Read RNA-seq Mapper

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

GBS Bioinformatics Pipeline(s) Overview

Standard output. Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines:

Part 1: How to use IGV to visualize variants

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

Exeter Sequencing Service

Short Read Alignment. Mapping Reads to a Reference

v0.3.0 May 18, 2016 SNPsplit operates in two stages:

MISO Documentation. Release. Yarden Katz, Eric T. Wang, Edoardo M. Airoldi, Christopher B. Bur

Mar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Tutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Package Rsubread. February 4, 2018

Package Rsubread. December 11, 2018

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

INTRODUCTION AUX FORMATS DE FICHIERS

Package Rsubread. June 29, 2018

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

TP RNA-seq : Differential expression analysis

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

m6aviewer Version Documentation

Eval: A Gene Set Comparison System

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

Mapping NGS reads for genomics studies

From the Schnable Lab:

Package scruff. November 6, 2018

The QoRTs Analysis Pipeline Example Walkthrough

Using Galaxy: RNA-seq

Subread/Rsubread Users Guide

panda Documentation Release 1.0 Daniel Vera

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.

NGS Data Analysis. Roberto Preste

Analyzing ChIP- Seq Data in Galaxy

Short Read Sequencing Analysis Workshop

RNASeq2017 Course Salerno, September 27-29, 2017

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

JunctionSeq Package User Manual

RNA-Seq data analysis software. User Guide 023UG050V0100

Tutorial 1: Exploring the UCSC Genome Browser

Fusion Detection Using QIAseq RNAscan Panels

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Package Rbowtie. January 21, 2019

ChIP-seq (NGS) Data Formats

Read Mapping. Slides by Carl Kingsford

RNA Sequencing with TopHat Alignment v1.0 and Cufflinks Assembly & DE v1.1 App Guide

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Differential gene expression analysis using RNA-seq

Sequence Data Quality Assessment Exercises and Solutions.

RNA Alternative Splicing and Structures

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

PICS: Probabilistic Inference for ChIP-Seq

Miniproject 1. Part 1 Due: 16 February. The coverage problem. Method. Why it is hard. Data. Task1

Exon Probeset Annotations and Transcript Cluster Groupings

MiSeq Reporter v2.2. Theory of Operation

Transcription:

Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010

Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice junctions from mapped reads 2 4 Combining reads and calculating the FDR 3 5 A complete example 4 6 Using jfinder on its own 4 1

1 Motivation In an RNA-Seq experiment, it is often of interest to identify transcript isoforms de novo, without respect to known genome annotations. In this document, I will describe the usage of our scripts and software to perform an important part of this problem the identification of sequencing reads which span exon-exon junctions. Our goal was to develop a procedure that is flexible enough to identify a large fraction of splice junctions and is also able to quantify our confidence in the reliability of the identified junctions. The software is all available at http://eqtl.uchicago.edu/rna_seq_data/ Software/. We assume that, as an initial step, all the reads have been mapped to the genome. Our procedure can roughly be divided into three steps: 1. Identification of potential junction-spanning reads 2. Calling precise splice junctions from mapped reads 3. Combining reads and assessment of a false discovery rate (FDR) 2 Identification of potential junction-spanning reads First, we find all the reads with have not mapped to the genome, split the read in two, and map each end independently to the genome. We rely heavily on existing tools like bwa. One script which may be useful is sam unmapped2fq trim.py. This script inputs a.sam.gz file and outputs a fastq.gz file generated by trimming N bases from one end of each unmapped read. USAGE: sam unmapped2fq trim.py [input.sam.gz] [output.fastq.gz] [f or l for "first" and "last"] [N] For example, sam unmapped2fq trim.py testin.sam.gz testout.fastq.gz f 20 will output the first 20 bases of unmapped reads in testin.sam.gz in fastq format. The.fastq.gz files can then be used as input to bwa or any other mapping tool. 3 Calling splice junctions from mapped reads We provide a tool, jfinder, for identifying the precise splice junctions supported by a read after performing the above steps. First, we filter out reads where the different ends of the read come from different chromosomes or different strands or map too far apart. To perform this filtering, use filter pair sequences sam.py. USAGE: testttfilter pair sequences sam.py [first.sam.gz] [last.sam.gz] [notsplit.sam.gz] [output.gz] This inputs the output from step one, as well as the original data file, and output the reads 2

where at least one end of the read maps with a quality score of at least 10, and, if both ends map, they map to the same strand of the same chromosome and within 100kb of each other. Now we can use jfinder on this output. USAGE: jfinder -min [minimum intron length] -max [maximum intron length] -l [length of the each end] -i [input file] -o [output file] -c [chromosome name] -cf [chromosome file (.fa.gz)] This must be done on each chromosome separately. This program may be of interest on its own outside of the pipeline described below; the input and output files are described in a separate section. 4 Combining reads and calculating the FDR We now have, for each read, the positions of the splice junctions compatible with each read. The next step is to combine these reads to a list of all the splice junctions in the data. The script we use for this is read2junc.py. USAGE: read2junc.py [input file (.gz)] [output file] We can now calculate the FDR of the junctions, using count splice sites.py. USAGE: count splice sites.py [input file] [output file] Printed to stdout is the number of GT-AG or GC-AG junctions, along with the FDR. The output file contains the following fields: 1. the positions of the first splice site consistent with the reads 2. the positions of the second splice site consistent with the reads 3. the number of reads spanning the junctions 4. the splice site dinucleotides corresponding to each of the first splice sites 5. the splice site dinucleotides corresponding to each of the second splice sites 6. is the splice site consistent with being a GT-AG or GC-AG junction? (0: no, 1:GC-AG, 2:GT-AG) 7. is the splice site consistent with the control dinucleotides (GT-TC or GC-TC)? (0:no, 1:GC- TC, 2:GT-TC) 3

5 A complete example Imagine we have mapped a lane of reads to the genome, and have the output in testlane.sam.gz. Now, how do we identify all the splice junctions on chromosome 1 supported by these reads? Below are all the commands in order. sam unmapped2fq trim.py testlane.sam.gz testlane trim1.fastq.gz f 20 sam unmapped2fq trim.py testlane.sam.gz testlane trim2.fastq.gz l 20...run bwa on these output, gzip the.sam output... filter pair sequences sam.py testlane trim1.sam.gz testlane trim2.sam.gz testlane.sam.gz testlane.filtered.paired.gz jfinder -l 20 -min 30 -max 100000 -i testlane.filtered.paired.gz -o testlane.chr1.junctionreads.gz -c chr1 -cf chr1.fa.gz awk {if ($5 > 9 && $7>9 && $13 < 3) print $0} gzip - > testlane.chr1.filtered.junctionreads.gz (this filters out alignments with more than 2 mismatches and with less then 10 bases on either side of the splice junction) read2junc.py testlane.chr1.filtered.junctionreads.gz chr1.juncs count splice sites.py chr1.juncs chr1.juncs.wss 6 Using jfinder on its own Once two ends of a sequencing read have been mapped separately, jfinder can be used to find the splice junctions consistent with each read. As described above, usage is as follows: USAGE: jfinder -min [minimum intron length] -max [maximum intron length] -l [length of the each end] -i [input file] -o [output file] -c [chromosome name] -cf [chromosome file (.fa.gz)] The input file is in the following format. On each line, separated by whitespace, are the following columns: 1. read name 2. sequence of read 4

3. strand 4. chromosome 5. position of first part of read (or NA) 6. position of second part of read (or NA) For example: HWI-EAS134:6:1:0:1724#0 CTTACTCACCCCAGCATGGAAACTACCACGAGGAG + chr8 NA 145137857 HWI-EAS134:6:1:0:1633#0 TGCACCGGTGCAGCCTCCCATGTCGCAGGCGGAGG + chrx NA 1497876 The output file contains the following columns (one for each read where a junction was found): 1. read name 2. chromosome 3. sequence of read 4. start of alignment 5. length of first part of alignment after extension (note that for reads where both ends map, this will be the length of the aligned fragment) 6. end of alignment 7. length of second part of alignment after extension (note that for reads where both ends map, this will be the length of the aligned fragment). 8. possible positions of first splice site (the first base of the intron), comma-separated 9. possible positions of the second splice site (the first base of the exon), comma-separated 10. corresponding intronic dinucleotides for each possible first splice site, comma-separated 11. corresponding intronic dinucleotides for each possible second splice site, comma-separated 12. number of possible splice sites 13. number of mismatches to the genome 14. did both ends of the original read map to the genome? (both = yes, one = no) For example, one such line might look like this: HWI-EAS134:6:1:245:1272#0 chr21 CCGACGTGCACCTTGATGAAGTAGTTTGTCCCCGC 44018629 9 44018991 26 44018638,44018639,44018640,44018641,44018642, 44018965,44018966,44018967,44018968,44018969, AC,CC,CT,TG,GG, CT,TA,AC,CC,CT, 5 0 one 5