Sequence Analysis Pipeline

Similar documents
RNA-seq. Manpreet S. Katari

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

RNA-Seq Analysis With the Tuxedo Suite

Reference guided RNA-seq data analysis using BioHPC Lab computers

Galaxy workshop at the Winter School Igor Makunin

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

SNP Calling. Tuesday 4/21/15

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software:

Tiling Assembly for Annotation-independent Novel Gene Discovery

Maize genome sequence in FASTA format. Gene annotation file in gff format

NGS Analysis Using Galaxy

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Short Read Sequencing Analysis Workshop

Goal: Learn how to use various tool to extract information from RNAseq reads.

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

NGS FASTQ file format

RNASeq2017 Course Salerno, September 27-29, 2017

Galaxy Platform For NGS Data Analyses

Mapping NGS reads for genomics studies

Evaluate NimbleGen SeqCap RNA Target Enrichment Data

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Single/paired-end RNAseq analysis with Galaxy

TopHat, Cufflinks, Cuffdiff

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

The software and data for the RNA-Seq exercise are already available on the USB system

Illumina Next Generation Sequencing Data analysis

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

RNA-seq Data Analysis

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Analyzing ChIP- Seq Data in Galaxy

NGS Data Analysis. Roberto Preste

Exercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads

Quality Control of Sequencing Data

How to store and visualize RNA-seq data

replace my_user_id in the commands with your actual user ID

NGS Data Visualization and Exploration Using IGV

Using Galaxy: RNA-seq

Bioinformatics in next generation sequencing projects

KisSplice. Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data. 29th may 2013

Ensembl RNASeq Practical. Overview

TP RNA-seq : Differential expression analysis

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Centre (CNIO). 3rd Melchor Fernández Almagro St , Madrid, Spain. s/n, Universidad de Vigo, Ourense, Spain.

Identiyfing splice junctions from RNA-Seq data

Our typical RNA quantification pipeline

Genomic Files. University of Massachusetts Medical School. October, 2015

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Differential gene expression analysis using RNA-seq

Rsubread package: high-performance read alignment, quantification and mutation discovery

Genomic Files. University of Massachusetts Medical School. October, 2014

High-throughout sequencing and using short-read aligners. Simon Anders

Rsubread package: high-performance read alignment, quantification and mutation discovery

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

The preseq Manual. Timothy Daley Victoria Helus Andrew Smith. January 17, 2014

Taller práctico sobre uso, manejo y gestión de recursos genómicos de abril de 2013 Assembling long-read Transcriptomics

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

Calling variants in diploid or multiploid genomes

RNA Sequencing with TopHat and Cufflinks

Read mapping with BWA and BOWTIE

RNA Sequencing with TopHat Alignment v1.0 and Cufflinks Assembly & DE v1.1 App Guide

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting

Testing for Differential Expression

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Connect to login8.stampede.tacc.utexas.edu. Sample Datasets

Variant calling using SAMtools

Handling sam and vcf data, quality control

CLC Server. End User USER MANUAL

Circ-Seq User Guide. A comprehensive bioinformatics workflow for circular RNA detection from transcriptome sequencing data

Visualization using CummeRbund 2014 Overview

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Analyzing Variant Call results using EuPathDB Galaxy, Part II

Introduction to Cancer Genomics

Lecture 12. Short read aligners

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Practical Linux Examples

Data Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

m6aviewer Version Documentation

Handling important NGS data formats in UNIX Prac8cal training course NGS Workshop in Nove Hrady 2014

Data Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis

RNA- SeQC Documentation

Quantification. Part I, using Excel

Variation among genomes

Package Rsubread. July 21, 2013

Welcome to GenomeView 101!

preparation methods and new bacterial strains. Parts of the pipeline that can be updated will be annotated in this guide.

ChIP-seq hands-on practical using Galaxy

From the Schnable Lab:

Differential gene expression analysis

Genomics. Nolan C. Kane

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.

A review of RNA-Seq normalization methods

Introduction to UNIX command-line II

Transcription:

Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation Discover biological significance Rebuild transcriptome

Assembly De novo Vs Reference guided Reference-guided - map to previously-assembled genome or transcriptome pros: computationally easier cons: must have good reference assembly to a related species, may miss some transcripts De novo - cluster reads into transcripts pros: no prior assemblies/annotations necessary cons: more difficult, more memory needed

Assembly De novo Vs Reference guided Reference-guided today De novo

Exercise 1 Two RNA-seq datasets used in the tomato genome project were downloaded from the SRA in.sra format and extracted using the SRA toolkit (NCBI). http://www.ncbi.nlm.nih.gov/sra They were already cleaned using fastq-mcf. All data is from S. pimpinellifolium. Datasets : - breaker fruit (two files) - immature fruit (two files) In this exercise, we will map the reads to tomato chromosome 4 using reference-guided assembly

Exercise 1 Run shell script run_tuxedo.sh (to be explained) 1. Find the file test_rnaseq.tar.gz in your ~/Data directory and unzip $cd ~/Data $tar -xvf test_rnaseq.tar.gz 2. Copy shell script run_tuxedo.sh to ~/Data/test_RNAseq/ and make it executable $chmod 755 ~/Data/test_RNAseq/run_tuxedo.sh

Exercise 1 Run shell script from ~/Data/test_RNAseq/ $cd ~/Data/test_RNAseq/ $./run_tuxedo.sh This will run ~15 minutes. We ll discuss what it is doing while it runs.

run_tuxedo.sh shell script that runs 6 linux commands (each can be typed in a terminal) STEP 1: uses bowtie2-build to index the tomato reference file. ~2 minutes $bowtie2-build bwt2_index/ SL2.40ch04.fa bwt2_index/sl2.40ch04

run_tuxedo.sh STEPS 2-5: use tophat2 to map each fastq file in test_rnaseq/ fastq to the reference. ~3 minutes each step $tophat2 -o fastq/breaker/srr404334_ch4_thout/ --no-novel-juncs \ --no-coverage-search bwt2_index/sl2.40ch04 fastq/breaker/srr404334_ch4.fastq STEP 6: runs cufflinks, but we will discuss this part later

Reference-guided assembly http://tophat.cbcb.umd.edu/ deals with splice junctions Illumina RNA-Seq reads (paired or single-end)

Reference-guided assembly Tophat output bam file of mapped reads (bam = compressed.sam file) bed files: junctions, insertions, deletions * Look into each output directory (breaker, immature_fruit) and convert accepted_hits.bam to accepted_hits.sam using samtools $samtools view -ho accepted_hits.sam accepted_hits.bam

Reference-guided assembly

Reference-guided assembly

Excellent explanation for bitwise flags http://seqanswers.com/forums/showthread.php?t=2301 Tool to translate meaning of bitwise flag: http://picard.sourceforge.net/explain-flags.html

Cigar string SAM format

Manipulating SAM files 1.Convert accepted_hits.bam to accepted_hits.sam $samtools view -ho accepted_hits.sam accepted_hits.bam 2.How many reads are in the tophat2 output file? $grep -v ^@ accepted_hits.sam wc

Manipulating SAM files 3. How many chromosomes are in the reference file? $grep ^@SQ accepted_hits.sam wc 4. How many reads map to each chromosome? $grep -v ^@ accepted_hits.sam cut -f3 sort uniq -c

Assembly - viewing output Tablet http://bioinf.scri.ac.uk/tablet/ On your machine: $ cd ~/Tablet $./tablet Indexing files for Tablet: 1. reference sequence samtools faidx SL2.40ch04.fa 2. bam file samtools index

Assembly - viewing output Load into Tablet: 1. reference.fa file 2. mapped reads.bam file 3. annotations.gff3 file

Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation Discover biological significance Rebuild transcriptome

Analysis - stats How good is the assembly? Check for contaminants by using BLAST to search appropriate databases (SeqClean) How many contigs are there vs how many genes expected - total sequence length? Length of contigs - compare to known long transcripts? How many reads map (reference-guided)?

Analysis - information SNPs Expression RNA-Seq Annotations

Analysis - expression

run_tuxedo.sh Exercise 2 STEP 6: Use cuffdiff to detect differentially expressed genes in immature and breake fruit using known chromosome 4 tomato gene models #Step 6: run cufflinks to find putatively differential expressed genes $cuffdiff -o cuffdiff_out -b bwt2_index/sl2.40ch04.fa -u annotation/itag2.3_gene_models_ch4.gtf fastq/breaker/srr404334_ch4_thout/accepted_hits.bam,fastq/ breaker/srr404336_ch4_thout/accepted_hits.bam fastq/immature_fruit/srr404331_ch4_thout/ accepted_hits.bam,fastq/immature_fruit/srr404333_ch4_thout/ accepted_hits.bam ~5 minutes

Analysis - expression Cufflinks can also be used for strand-specific RNA-seq novel transcript discovery in annotated genomes identification of novel splice variants detecting transcripts in genomes without annotation For protocols see: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks (Trapnell et al., 2012).

Analysis - expression Reads Per Kilobase of exon model per Million mapped reads [Mortazavi et al., 2008] RPKM = Total exon reads mapped reads (millions) x exon length (KB) *Cufflinks uses the analogous FPKM

Analysis - expression Challenges paralogs/reads that map to more than one place Cufflinks divides these evenly across the matches or tries to assign them a location based on expression of surrounding reads.

Cufflinks output Extract genes that have a significant value for differential expression $awk ' $14 = "yes" {print $0} ' gene_exp.diff > significant_genes.txt

Other tools Bedtools - used to compare features SNPeff - gives predicted effect of SNP (synonymous, nonsynonymous, etc. SRA toolkit - for converting NGS data files from the Sequence Read Archive Picard - duplicate detection, etc. Prinseq - duplicate removal Dindel - indel calling from NGS mapping data R Packages: Bioconductor, DESeq, edger seqanswers.com