Exeter Sequencing Service

Similar documents
CLC Server. End User USER MANUAL

User Manual. This is the example for Oases: make color 'VELVET_DIR=/full_path_of_velvet_dir/' 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1'

Importing your Exeter NGS data into Galaxy:

1. Download the data from ENA and QC it:

Galaxy Platform For NGS Data Analyses

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Understanding and Pre-processing Raw Illumina Data

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Next Generation Sequencing Workshop De novo genome assembly

INTRODUCTION TO BIOINFORMATICS

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

README _EPGV_DataTransfer_Illumina Sequencing

Bioinformatics in next generation sequencing projects

RNA-seq. Manpreet S. Katari

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Tutorial. OTU Clustering Step by Step. Sample to Insight. March 2, 2017

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

Browser Exercises - I. Alignments and Comparative genomics

NGS FASTQ file format

Identiyfing splice junctions from RNA-Seq data

Tutorial 1: Exploring the UCSC Genome Browser

Tutorial: De Novo Assembly of Paired Data

Velvet Manual - version 1.1

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Sequence Data Quality Assessment Exercises and Solutions.

Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab

1 Abstract. 2 Introduction. 3 Requirements

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Goal: Learn how to use various tool to extract information from RNAseq reads.

Tutorial. De Novo Assembly of Paired Data. Sample to Insight. November 21, 2017

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Copyright 2014 Regents of the University of Minnesota

Copyright 2014 Regents of the University of Minnesota

ChIP-seq hands-on practical using Galaxy

Assessing Transcriptome Assembly

ChIP-Seq Tutorial on Galaxy

Release Notes. Version Gene Codes Corporation

Tutorial 4 BLAST Searching the CHO Genome

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Hands-on Instruction in Sequence Assembly

INTRODUCTION TO BIOINFORMATICS

Fast-track to Gene Annotation and Genome Analysis

Agilent Genomic Workbench Lite Edition 6.5

Variant calling using SAMtools

DNA sequences obtained in section were assembled and edited using DNA

NGS : reads quality control

Introduc)on to annota)on with Artemis. Download presenta.on and data

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

Tutorial. OTU Clustering Step by Step. Sample to Insight. June 28, 2018

see also:

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Analyzing ChIP- Seq Data in Galaxy

Finding data. HMMER Answer key

Tutorial for Windows and Macintosh. De Novo Sequence Assembly with Velvet

These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data.

Module 1 Artemis. Introduction. Aims IF YOU DON T UNDERSTAND, PLEASE ASK! -1-

RNA-Seq Analysis With the Tuxedo Suite

De novo sequencing and Assembly. Andreas Gisel International Institute of Tropical Agriculture (IITA) Ibadan, Nigeria

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

High-throughout sequencing and using short-read aligners. Simon Anders

ABySS. Assembly By Short Sequences

Sequence Analysis Pipeline

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

HORIZONTAL GENE TRANSFER DETECTION

SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

Taller práctico sobre uso, manejo y gestión de recursos genómicos de abril de 2013 Assembling long-read Transcriptomics

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

RNA-seq Data Analysis

Sequence Preprocessing: A perspective

RNA-Seq data analysis software. User Guide 023UG050V0100

One report (in pdf format) addressing each of following questions.

RNA- SeQC Documentation

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

ChIP-seq (NGS) Data Formats

MetaPhyler Usage Manual

Wei Shen Third Military Medical University, China Aug 2016

Uploading sequences to GenBank

Advanced UCSC Browser Functions

Manual of SOAPdenovo-Trans-v1.03. Yinlong Xie, Gengxiong Wu, Jingbo Tang,

Tutorial: chloroplast genomes

Maize genome sequence in FASTA format. Gene annotation file in gff format

RNA-Seq data analysis software. User Guide 023UG050V0200

Performing a resequencing assembly

AMPHORA2 User Manual. An Automated Phylogenomic Inference Pipeline for Bacterial and Archaeal Sequences. COPYRIGHT 2011 by Martin Wu

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

NGS Analysis Using Galaxy

Introduction to Bioinformatics Problem Set 3: Genome Sequencing

Transcription:

Exeter Sequencing Service A guide to your denovo RNA-seq results An overview Once your results are ready, you will receive an email with a password-protected link to them. Click the link to access your data (note this will not work with the Safari browser - please use Firefox if you are on a Mac). Your results directory will look something like this: In this case the results are for 5 samples (labelled Sample_24 and Sample_3d etc). The first file to check is the Summary.html file. This contains important information concerning basic run metrics for your samples. The file should look similar to: Note that if you have any multiplexed samples, the multiplex barcodes used to identify your sample are listed - however your samples have already been demultiplexed (no mismatches are permitted within the barcode). Therefore there is no need to demultiplex your samples yourself, unless you are using non-standard barcode adaptors.

Total sequence yield is given along with the total number of reads per sample. Note that these numbers are pre-filtering Illumina chastity filtering and pre quality score trimming. The number of reads used in the final analysis will be lower. Ideally the mean quality score should be above 30. Remember that this is a Phred-based quality score (i.e. 10 = 1 in 10 chance of the base being incorrect, 20 = 1 in 100 chance of the base being incorrect, 30 = 1 in 1000 chance of the base being incorrect etc). Flowcell IDs and other information useful if submitting your raw sequence data to public databases are also given in the Summary.html file. Note that samples may sometimes be spiked with PhiX sequence to act as quality control. Although these are filtered out before the sequence files are provided to you, it is worth bearing in mind in case a few reads slip through. Meta-analyses: analyses: 1. pfam_comparison/ This directory contains the file pfam_comparison.txt. This is a text file suitable for viewing in a spreadsheet. In the example below, PFAM domain PF00020 (TNFR_c6 - Tumor necrosis factor receptor) is present in Sample N and Sample CM but absent from the other four samples, whereas PF00002 (7tm_2 Secretin receptor) is present in all five samples.

Sample directories For each sample in your project, a series of analyses are performed. The first is a denovo assembly using Velvet and Oases (http://ebi.ac.uk/~zerbino/velvet http://ebi.ac.uk/~zerbino/oases). Velvet performs an initial denovo assembly which Oases then refines to form transcripts. Once completed, the results are then passed to the cuffdiff package to perform differential expression analysis (http://cufflinks.cbcb.umd.edu/manual.html). denovo_ enovo_assembly assembly/ This directory contains the results of the denovo assembly using Velvet and Oases. Log Unusedreads.fa contigs.fa Text file. Contains log information and parameters used for Velvet assembly. FASTA file containing reads not used in the assembly. FASTA file containing the assembled contigs from Velvet.

FASTA file. This is essentially the final results of the denovo assembly. Transcripts are listed in the format: Locus_1_Transcript_1/2_Confidence_1.000_Length_418 transcripts.fa Where Locus indicates a suspected genomic locus, transcript ½ indicates that this is isoform 1 of 2 and confidence indicates the fraction of reads within this locus which support this isoform. stats.txt Text file suitable for spreadsheet.velvet statistics (note coverage and length is based on kmer coverage and length) 1. Log This is the Log file produced by Velvet. This details which version and parameters were used along with summary statistics such as N50, number of contigs, maximum contig size, total assembly size and median depth of coverage. E.g (note version number etc will vary, please check your Log file for details): Wed Nov 9 19:11:33 2011 velvetg remapping_to_reference/unmappedreads_assembly/ -cov_cutoff auto -exp_cov auto - unused_reads yes Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compilation settings: CATEGORIES = 2 MAXKMERLENGTH = 101 Median coverage depth = 81.722184 Final graph has 187 nodes and n50 of 17387, max 103277, total 644417, using 2558521/3274800 reads denovo_assembly/assembly_summary_statistics stats.txt sorted_contigs.fa *.png *.dat File Description Summary statistics for the assembly FASTA file containing isotigs ordered by size Graphical representations of assembly metrics Data used to generate histograms Of greatest interest to most users will be the stats.txt file which contains details of various assembly metrics. E.g:

Statistics for isotig lengths: Min isotig length: 161 Max isotig length: 75,867 Mean isotig length: 6988.76 Standard deviation of isotig length: 7215.64 Median isotig length: 4,830 N50 isotig length: 11,691 Statistics for numbers of isotigs: Number of isotigs: 5,296 Number of isotigs >=1kb: 4,539 Number of isotigs in N50: 966 Statistics for bases in the isotigs: Number of bases in all isotigs: 37,012,455 Number of bases in isotigs >=1kb: 36,625,379 GC Content of isotigs: 50.27 % Here we can see that this assembly contains 5296 contiguous sequences of DNA, 4539 of which are larger than 1kb. The total size of the assembly (i.e. Number of bases in all isotigs) should tally with the transcriptome size you are expecting. However, beware of possible ploidy effects if your initial estimate seems to be out be factor of 2 or more. expression_ xpression_estimate/ estimate/ File transcripts_expression_estimate.txt mapping_reads_to_transcripts.bam mappingstats.txt Description Text file suitable for viewing in a spreadsheet. Contains coverage information for each transcript. Binary AlignMent file. Contains the details of how the reads map to the contigs. View in IGV with transcripts.fa file as reference to visualise mapping of reads to transcripts. Text file. Contains summary statistics of mapping of reads to contigs.

Files which are likely to be of particular interest: 1. transcripts_expression_estimate.txt Contains details of the number of reads mapping to the contigs. E.g. Contig ID Number of reads NODE_1_length_6194_cov_4305.022949 470559 NODE_2_length_6789_cov_5382.181641 520777 NODE_3_length_1653_cov_590.568665 55141 NODE_4_length_2712_cov_48.6733066 8327 2. mappingstats.txt 156461298 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 135416978 + 0 mapped (86.55%:nan%) 156461298 + 0 paired in sequencing 78230649 + 0 read1 78230649 + 0 read2 135416978 + 0 properly paired (86.55%:nan%) 135416978 + 0 with itself and mate mapped 0 + 0 singletons (0.00%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapq>=5) This indicates that 86.55% of reads mapped to the transcripts and all did so with the expected paired- end insert size (only applicable for paired-end reads). See the IGV user guide at http://www.broadinstitute.org/software/igv/usergui ide for details of how to load these into IGV to visualise your data.

Annotation/ transcripts.fa.blastp File transcripts.fa.blastp.table transcripts.fa.orf transcripts.orf.pfam taxonomy.txt taxonomy_of_unmapped_denovo_transcripts. pdf Description Text file. BlastP results of transcripts.fa.orf file against NCBI non redundant nucleotide database. Text file suitable for spreadsheet. As above in tabular format. FASTA file. Contains open reading frames translated into protein space using the appropriate codon usage table. Note that nucleotide start/stop co-ordinates are given in the header. ORFs are reported between any two STOP codons with a distance larger than 302 nucleotides. Text file suitable for spreadsheet. PFAM results for all the ORFs in the file above. Text file.taxonomy file generated from general identifier numbers in transcripts.fa.blastn.table. PDF file. Contains a visual representation of the numbers of contigs mapping to a given taxa.

Several files here are likely to be of particular interest to the user: 1. transcripts.fa.orf This contains the open reading frames called by the EMBOSS program getorf. Note that the header contains the name of the transcript, and has appended the start and end locations of the ORF in the nucleotide file (transcripts.fa). For example, below the first entry has an ORF on the reverse strand between positions 329 and 3. The second has an ORF between positions 213 and 1043 on the forward strand. >NODE_2_length_311_cov_1462.295776_1[329-3](REVERSESENSE) YGHEWRRMSRQCTHYGRWPQHGFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAK SRQRWLFYAYDRIRRTVVAHVFGERTLATLERLLSLLSAFEVVVWMTDG >NODE_3_length_1925_cov_70.433769_1[213-1043] WWPAMNARVAKLALDARAIRQSIIRTASAAPVDGVHLGPALSMVEIAAALYGAVMRFDPK NMASMARDRFLLSKGHAALALYATLHHYGVLSDDELATFDHSGSRFPALTPMNPPLGIDF AGGSLGMGVGYACGAALAQRLRGESWRHYIVLGDGECNEGSVWESAFFAAQQGLDQLTAI VDCNGFQSDWSCEQTIKMDFPALWAACGWHVETCDGHDIAALLAALDAPSHGKPKAIVAR TVKGKGVSFMEHNNAFHRARLSAAQRDAALAELEAHP 2. transcripts.orf.pfam Contains the PFAM domains called for each ORF in the file above. Search the PFAM database using the ID (e.g. ResIII) for more details. # <seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value> <significance> <clan> NODE_1_length_103317_cov_46.027489_77 131 287 116 288 PF04851.8 ResIII Family 26 183 184 56..9 1.7e-15 1 CL0023 NODE_1_length_103317_cov_46.027489_77 406 482 405 482 PF00271.24 Helicase_C Family 2 78 78 44..0 1.1e-11 1 CL0023 NODE_1_length_103317_cov_46.027489_79 22 188 22 188 PF03614.6 Flag1_repress Family 1 165 165 267..7 2e-80 1 No_clan 3. taxonomy_of_unmapped_denovo_transcripts.pdf This is the file most likely to be of interest here. It contains a graphical representation of the species transcripts map to. In the examplee below, red regions correspond to those with the most transcript mapping. The numbers next to each entry indicate the numbers of transcripts mapping to each location. Some transcripts may map to more than one location.

raw_illumina_reads/ This directory contains the raw sequence data for this sample The nucleotide_distribution and read_quality files contain the nucleotide distribution and quality scores (see QC section of this site). Files ending R1_001.fastq contain the raw sequence data for read 1 and those ending R2_001.fastq contain raw sequence data for read 2. Those ending.filtered contain filtered data using the ea-utils kit.