Exeter Sequencing Service A guide to your denovo RNA-seq results An overview Once your results are ready, you will receive an email with a password-protected link to them. Click the link to access your data (note this will not work with the Safari browser - please use Firefox if you are on a Mac). Your results directory will look something like this: In this case the results are for 5 samples (labelled Sample_24 and Sample_3d etc). The first file to check is the Summary.html file. This contains important information concerning basic run metrics for your samples. The file should look similar to: Note that if you have any multiplexed samples, the multiplex barcodes used to identify your sample are listed - however your samples have already been demultiplexed (no mismatches are permitted within the barcode). Therefore there is no need to demultiplex your samples yourself, unless you are using non-standard barcode adaptors.
Total sequence yield is given along with the total number of reads per sample. Note that these numbers are pre-filtering Illumina chastity filtering and pre quality score trimming. The number of reads used in the final analysis will be lower. Ideally the mean quality score should be above 30. Remember that this is a Phred-based quality score (i.e. 10 = 1 in 10 chance of the base being incorrect, 20 = 1 in 100 chance of the base being incorrect, 30 = 1 in 1000 chance of the base being incorrect etc). Flowcell IDs and other information useful if submitting your raw sequence data to public databases are also given in the Summary.html file. Note that samples may sometimes be spiked with PhiX sequence to act as quality control. Although these are filtered out before the sequence files are provided to you, it is worth bearing in mind in case a few reads slip through. Meta-analyses: analyses: 1. pfam_comparison/ This directory contains the file pfam_comparison.txt. This is a text file suitable for viewing in a spreadsheet. In the example below, PFAM domain PF00020 (TNFR_c6 - Tumor necrosis factor receptor) is present in Sample N and Sample CM but absent from the other four samples, whereas PF00002 (7tm_2 Secretin receptor) is present in all five samples.
Sample directories For each sample in your project, a series of analyses are performed. The first is a denovo assembly using Velvet and Oases (http://ebi.ac.uk/~zerbino/velvet http://ebi.ac.uk/~zerbino/oases). Velvet performs an initial denovo assembly which Oases then refines to form transcripts. Once completed, the results are then passed to the cuffdiff package to perform differential expression analysis (http://cufflinks.cbcb.umd.edu/manual.html). denovo_ enovo_assembly assembly/ This directory contains the results of the denovo assembly using Velvet and Oases. Log Unusedreads.fa contigs.fa Text file. Contains log information and parameters used for Velvet assembly. FASTA file containing reads not used in the assembly. FASTA file containing the assembled contigs from Velvet.
FASTA file. This is essentially the final results of the denovo assembly. Transcripts are listed in the format: Locus_1_Transcript_1/2_Confidence_1.000_Length_418 transcripts.fa Where Locus indicates a suspected genomic locus, transcript ½ indicates that this is isoform 1 of 2 and confidence indicates the fraction of reads within this locus which support this isoform. stats.txt Text file suitable for spreadsheet.velvet statistics (note coverage and length is based on kmer coverage and length) 1. Log This is the Log file produced by Velvet. This details which version and parameters were used along with summary statistics such as N50, number of contigs, maximum contig size, total assembly size and median depth of coverage. E.g (note version number etc will vary, please check your Log file for details): Wed Nov 9 19:11:33 2011 velvetg remapping_to_reference/unmappedreads_assembly/ -cov_cutoff auto -exp_cov auto - unused_reads yes Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compilation settings: CATEGORIES = 2 MAXKMERLENGTH = 101 Median coverage depth = 81.722184 Final graph has 187 nodes and n50 of 17387, max 103277, total 644417, using 2558521/3274800 reads denovo_assembly/assembly_summary_statistics stats.txt sorted_contigs.fa *.png *.dat File Description Summary statistics for the assembly FASTA file containing isotigs ordered by size Graphical representations of assembly metrics Data used to generate histograms Of greatest interest to most users will be the stats.txt file which contains details of various assembly metrics. E.g:
Statistics for isotig lengths: Min isotig length: 161 Max isotig length: 75,867 Mean isotig length: 6988.76 Standard deviation of isotig length: 7215.64 Median isotig length: 4,830 N50 isotig length: 11,691 Statistics for numbers of isotigs: Number of isotigs: 5,296 Number of isotigs >=1kb: 4,539 Number of isotigs in N50: 966 Statistics for bases in the isotigs: Number of bases in all isotigs: 37,012,455 Number of bases in isotigs >=1kb: 36,625,379 GC Content of isotigs: 50.27 % Here we can see that this assembly contains 5296 contiguous sequences of DNA, 4539 of which are larger than 1kb. The total size of the assembly (i.e. Number of bases in all isotigs) should tally with the transcriptome size you are expecting. However, beware of possible ploidy effects if your initial estimate seems to be out be factor of 2 or more. expression_ xpression_estimate/ estimate/ File transcripts_expression_estimate.txt mapping_reads_to_transcripts.bam mappingstats.txt Description Text file suitable for viewing in a spreadsheet. Contains coverage information for each transcript. Binary AlignMent file. Contains the details of how the reads map to the contigs. View in IGV with transcripts.fa file as reference to visualise mapping of reads to transcripts. Text file. Contains summary statistics of mapping of reads to contigs.
Files which are likely to be of particular interest: 1. transcripts_expression_estimate.txt Contains details of the number of reads mapping to the contigs. E.g. Contig ID Number of reads NODE_1_length_6194_cov_4305.022949 470559 NODE_2_length_6789_cov_5382.181641 520777 NODE_3_length_1653_cov_590.568665 55141 NODE_4_length_2712_cov_48.6733066 8327 2. mappingstats.txt 156461298 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 135416978 + 0 mapped (86.55%:nan%) 156461298 + 0 paired in sequencing 78230649 + 0 read1 78230649 + 0 read2 135416978 + 0 properly paired (86.55%:nan%) 135416978 + 0 with itself and mate mapped 0 + 0 singletons (0.00%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapq>=5) This indicates that 86.55% of reads mapped to the transcripts and all did so with the expected paired- end insert size (only applicable for paired-end reads). See the IGV user guide at http://www.broadinstitute.org/software/igv/usergui ide for details of how to load these into IGV to visualise your data.
Annotation/ transcripts.fa.blastp File transcripts.fa.blastp.table transcripts.fa.orf transcripts.orf.pfam taxonomy.txt taxonomy_of_unmapped_denovo_transcripts. pdf Description Text file. BlastP results of transcripts.fa.orf file against NCBI non redundant nucleotide database. Text file suitable for spreadsheet. As above in tabular format. FASTA file. Contains open reading frames translated into protein space using the appropriate codon usage table. Note that nucleotide start/stop co-ordinates are given in the header. ORFs are reported between any two STOP codons with a distance larger than 302 nucleotides. Text file suitable for spreadsheet. PFAM results for all the ORFs in the file above. Text file.taxonomy file generated from general identifier numbers in transcripts.fa.blastn.table. PDF file. Contains a visual representation of the numbers of contigs mapping to a given taxa.
Several files here are likely to be of particular interest to the user: 1. transcripts.fa.orf This contains the open reading frames called by the EMBOSS program getorf. Note that the header contains the name of the transcript, and has appended the start and end locations of the ORF in the nucleotide file (transcripts.fa). For example, below the first entry has an ORF on the reverse strand between positions 329 and 3. The second has an ORF between positions 213 and 1043 on the forward strand. >NODE_2_length_311_cov_1462.295776_1[329-3](REVERSESENSE) YGHEWRRMSRQCTHYGRWPQHGFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAK SRQRWLFYAYDRIRRTVVAHVFGERTLATLERLLSLLSAFEVVVWMTDG >NODE_3_length_1925_cov_70.433769_1[213-1043] WWPAMNARVAKLALDARAIRQSIIRTASAAPVDGVHLGPALSMVEIAAALYGAVMRFDPK NMASMARDRFLLSKGHAALALYATLHHYGVLSDDELATFDHSGSRFPALTPMNPPLGIDF AGGSLGMGVGYACGAALAQRLRGESWRHYIVLGDGECNEGSVWESAFFAAQQGLDQLTAI VDCNGFQSDWSCEQTIKMDFPALWAACGWHVETCDGHDIAALLAALDAPSHGKPKAIVAR TVKGKGVSFMEHNNAFHRARLSAAQRDAALAELEAHP 2. transcripts.orf.pfam Contains the PFAM domains called for each ORF in the file above. Search the PFAM database using the ID (e.g. ResIII) for more details. # <seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value> <significance> <clan> NODE_1_length_103317_cov_46.027489_77 131 287 116 288 PF04851.8 ResIII Family 26 183 184 56..9 1.7e-15 1 CL0023 NODE_1_length_103317_cov_46.027489_77 406 482 405 482 PF00271.24 Helicase_C Family 2 78 78 44..0 1.1e-11 1 CL0023 NODE_1_length_103317_cov_46.027489_79 22 188 22 188 PF03614.6 Flag1_repress Family 1 165 165 267..7 2e-80 1 No_clan 3. taxonomy_of_unmapped_denovo_transcripts.pdf This is the file most likely to be of interest here. It contains a graphical representation of the species transcripts map to. In the examplee below, red regions correspond to those with the most transcript mapping. The numbers next to each entry indicate the numbers of transcripts mapping to each location. Some transcripts may map to more than one location.
raw_illumina_reads/ This directory contains the raw sequence data for this sample The nucleotide_distribution and read_quality files contain the nucleotide distribution and quality scores (see QC section of this site). Files ending R1_001.fastq contain the raw sequence data for read 1 and those ending R2_001.fastq contain raw sequence data for read 2. Those ending.filtered contain filtered data using the ea-utils kit.