Centre (CNIO). 3rd Melchor Fernández Almagro St , Madrid, Spain. s/n, Universidad de Vigo, Ourense, Spain.

Size: px

Start display at page:

Download "Centre (CNIO). 3rd Melchor Fernández Almagro St , Madrid, Spain. s/n, Universidad de Vigo, Ourense, Spain."

Annis Brown
5 years ago
Views:

1 O. Graña *a,b, M. Rubio-Camarillo a, F. Fdez-Riverola b, D.G. Pisano a and D. Glez-Peña b a Bioinformatics Unit, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO). 3rd Melchor Fernández Almagro St , Madrid, Spain. b ESEI - Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, Universidad de Vigo, Ourense, Spain. ograna@cnio.es

2 nextpresso v1.4 1

3 Contents 1. Introduction Prerequisites Input files Configuration files Execution Output files

4 nextpresso v Introduction The pipeline performs a complete analysis of RNA-seq data, across four different execution levels (1) read quality and contamination checks, (2) read preprocessing through read trimming and/or down-sampling, (3) aligning of reads to the genomic or transcriptomic references and (4) processing of the obtained alignments to perform the different analysis (Figure 1). Figure 1. Workflow that shows the four execution levels of nextpresso. nextpresso has been designed for execution on HPC, scheduled by an SGE system or by a PBS system. Although sequential execution in a single workstation is also allowed. ****In case of having problems with the installation or execution, or detecting some bug, please send an to ograna@cnio.es, in order to help you with the problem or to try to solve the bug. 3

5 2. Prerequisites 2.1. Operating System: UNIX based operative systems, e.g. Linux or MAC OSX Required 3 rd party software: Before executing nextpresso, the programs and libraries listed below and their corresponding dependencies must be correctly installed. 1. FastQC FastScreen BEDTools Samtools Bowtie Tophat Seqtk 8. PeakAnalyzer 9. HTSeq-count 10.Cufflinks BedGraphToBigWig 12.GSEA Perl, with the following additional modules from CPAN: XML/Simple.pm XML/Validator/Schema.pm /Schema.pm XML/LibXML.pm Excel/Writer/XLSX.pm XLSX/lib/Excel/Writer/XLSX.pm GDGraph/Graph.pm 14.R environment or higher Additional R libraries and packages scatterplot3d S4Vectors DESeq2 BiocParallel affyio 4

6 3. Input files Input files can be FASTQ files or raw BAM files (with unaligned reads). Raw BAM files are converted to FASTQ during execution. 4. Configuration files There are two configuration files: configuration.xml to set the program locations and the queue schedulers management, and experiment.xml, with all the experiment details, i.e. definition of samples in the experiment, comparisons to perform and parameter values for the programs used in the different steps. Take into account that the configuration.xml file is valid for all the analysed experiments unless the hardware or programs used have changed. In this case it would require to update the file. configuration.xml Stores all the program locations. Introduces the queue scheduler parameters in case of execution in a computer cluster. An example is shown below: <?xml version="1.0" encoding="utf-8"?> <configurationparameters maximunnumberofinstancesallowedtorunsimultaneouslyinoneparticularstep="4"> <extrapathsrequired></extrapathsrequired> <fastqcpath>/home/ograna/software/fastqc_v0.10.1</fastqcpath> <fastqscreen> <path>/home/ograna/software/fastq_screen_v0.4.2</path> <configurationfile> /home/ograna/software/fastq_screen_v0.4.2/fastq_screen.conf </configurationfile> <subset>10000</subset> </fastqscreen> <bedtoolspath>/home/ograna/software/bedtools-version /bin</bedtoolspath> <samtoolspath>/home/ograna/software/samtools </samtoolspath> <bowtiepath>/home/ograna/software/bowtie-1.0.0</bowtiepath> <tophatpath>/home/ograna/software/tophat/tophat linux_x86_64</tophatpath> <seqtkpath>/home/ograna/software/seqtk/seqtk-master/</seqtkpath> <peakannotatorpath> /home/ograna/software/peakanalyzer/modified_peakannotator </peakannotatorpath> <htseqcount> <path>/home/ograna/software/htseq-0.5.3p9/build/scripts-2.7</path> </htseqcount> <tophatfusion> <path>/home/ograna/software/tophat/tophat linux_x86_64</path> </tophatfusion> <cufflinks> <path>/home/ograna/software/cufflinks linux_x86_64</path> </cufflinks> <bedgraphtobigwig> <path>/home/ograna/software/bedgraphtobigwig</path> </bedgraphtobigwig> <gsea> <path>/home/ograna/software/gsea/gsea jar</path> <chip>gseaftp.broadinstitute.org://pub/gsea/annotations/gene_symbol.chip</chip> <maxmemory>8g</maxmemory> 5

7 </gsea> <queuesystem>none</queuesystem> <queuename>none</queuename> <multicore>2</multicore> </configurationparameters> All the definitions pointed out above are mandatory. Without setting them properly, nextpresso wouldn't be able to complete the execution, as they are checked in first place. maximunnumberofinstancesallowedtorunsimultaneouslyinoneparticularstep: ( y e s, what a name... ) represents the number of instances of a program that can be launched at once. For example: The number of Tophat instances that can be launched simultaneosly, each one aligning reads from one sample to the reference simultaneously. extrapathsrequired: empty by default. Use it just in case that some additional paths should be specified (this depends very much on the computer where it is executed), like for example: <extrapathsrequired> LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:/Volumes/RAID/Soft/Linux_x86_64/System/boost/1.51/lib/ </extrapathsrequired> queuesystem: represents the type of queue scheduler. The accepted values are: SGE, PBS and none (the latter for the execution in non-queue controlled systems, like in single workstations). ****If queuesystem is initialized to n o n e, the multisample execution will not be performed in a parallel way, but in a sequential execution, because there is no scheduler to synchronize the programs. queuename: the name of the queue in which the tasks are going to run. Default value: normal ***(in my case). multicore: Only valid for SGE systems. Represents the number of slots to reserve for the execution. To use this feature, the SGE manager must create a parallel environment called multicore. Experiment.xml Within this file there are definitions that are particular for each experiment: it contains the names, locations and types of the different samples, the comparisons to perform and finally the parameter values to use with the different programs. Example for a paired-end experiment, with only two samples, WT and KO. <?xml version="1.0" encoding="utf-8"?> <experiment name="myprojectname" workspace="/mnt/ograna/rnaseq/analysis" referencesequence="/references/mus_musculus/bowtieindex/genome.fa" GTF="/REFERENCES/Mus_musculus/genes.gtf" pairedend="true"> <library name="wt" leftfile="wt_1.fastq"> 6

8 <rightfile>wt_2.fastq</rightfile> <type>fastq</type> <solexaqualityencoding></solexaqualityencoding> <librarytype>firststrand</librarytype> <trimming do="false"> <nnucleotidesleftend>3</nnucleotidesleftend> <nnucleotidesrightend>5</nnucleotidesrightend> </trimming> <downsampling do="false"> <seed>3</seed> <nreads> </nreads> </downsampling> <mateinnerdist>197</mateinnerdist> <matestddev>50</matestddev> </library> <library name="ko" leftfile="ko_1.fastq"> <rightfile>ko_2.fastq</rightfile> <type>fastq</type> <solexaqualityencoding></solexaqualityencoding> <librarytype>firststrand</librarytype> <trimming do="false"> <nnucleotidesleftend>3</nnucleotidesleftend> <nnucleotidesrightend>5</nnucleotidesrightend> </trimming> <downsampling do="false"> <seed>3</seed> <nreads>0</nreads> </downsampling> <mateinnerdist>194</mateinnerdist> <matestddev>50</matestddev> </library> <comparison name="kovswt"> <condition name="wt" cuffdiffposition="1"> <libraryname>wt</libraryname> </condition> <condition name="ko" cuffdiffposition="2"> <libraryname>ko</libraryname> </condition> </comparison> <tophat usegtf="true" ntophatthreads="4" maxmultihits="20" readmismatches="2" segmentlength="20" segmentmismatches="1" splicemismatches="0" reportsecondaryalignments="false" bowtie="1" readeditdist="4" readgaplength="2" referenceindexing="false"> <coveragesearch>--no-coverage-search</coveragesearch> <fusionsearchexperiment performfusionsearch="true"> </fusionsearchexperiment> </tophat> <cufflinks usegtf="true" nthreads="14" fragbiascorrect="true" multireadcorrect="true" librarynormalizationmethod="classic-fpkm" maxbundlefrags=" "> </cufflinks> <cuffmerge nthreads="4"> </cuffmerge> <cuffquant usecuffmergeassembly="false" nthreads="4" fragbiascorrect="true" multireadcorrect="true" seed="123l" maxbundlefrags=" "> </cuffquant> <cuffnorm usecuffmergeassembly="false" nthreads="4" outputformat="simple-table" librarynormalizationmethod="classic-fpkm" seed="123l" normalization="compatiblehits"> </cuffnorm> <cuffdiff usecuffmergeassembly="false" nthreads="4" fragbiascorrect="true" multireadcorrect="true" librarynormalizationmethod="classic-fpkm" FDR="0.05" minalignmentcount="5" seed="123l" FPKMthreshold="0.05" maxbundlefrags=" "> </cuffdiff> <htseqcount minaqual="0" featuretype="exon" idattr="gene_id"> <mode>intersection-nonempty</mode> </htseqcount> <deseq2 nthreads="2" alpha="0.05" padjustmethod="fdr"></deseq2> <bedgraphtobigwig 7

9 </experiment> chromosomesizesfile="/mnt/supertocho/ograna/references/mm9q.chromosome.sizes" bigdataurlprefix=" </bedgraphtobigwig> <gsea collapse="false" mode="max_probe" norm="meandiv" nperm="1000" scoring_scheme="classic" include_only_symbols="true" make_sets="true" plot_top_x="250" rnd_seed="123" set_max="1000" set_min="10" zip_report="true"> <geneset>/gsea_pathways_definitions/c3.mir.v4.0.symbols_microrna_targets.gmt</geneset> <geneset>/gsea_pathways_definitions/c3.tft.v4.0.symbols_transcriptionfactors.gmt</geneset> <geneset>/gsea_pathways_definitions/c4.cm.v4.0.symbols_cancer_modules.gmt</geneset> <geneset>/gsea_pathways_definitions/c2.cp.kegg.v4.0.symbols.gmt</geneset> </gsea> <tophatfusion ntophatfusionthreads="2" numfusionreads="3" numfusionpairs="2" numfusionboth="0" fusionreadmismatches="2" fusionmultireads="2" nonhuman="false" pathtoannotationfiles="/mnt/supertocho/ograna/references/tophatfusion/" pathtoblastall="/home/ograna/software/blast/blast /bin" pathtoblastn="/home/ograna/software/blast/ncbi-blast /bin"> </tophatfusion> <spikeincontrolmixes do="false" ref="/home/ograna/spikes/ficheros_spikes/ercc92.fa" gtf="/home/ograna/spikes/ficheros_spikes/ercc92.gtf" nthreadsforbowtie="8"> </spikeincontrolmixes> ****If it was the case of a single-end experiment, the only difference would be to set pairedend="false", and all the righfile fields empty, e.g. <rightfile></rightfile> The values of <mateinnerdist>197</mateinnerdist> and <matestddev>50</matestddev> would not be taken into account. 5. Execution Executing the pipeline is easy once that we configured both xml files. An execution explanation is given by simply typing: 'perl RNAseq.pl', showing the following message: perl RNAseq.pl --configdoc configdocfile --expdoc expdocfile --step step_number Example: a) complete execution of all steps in each workflow level perl RNAseq.pl --step configdoc config/configurationparameters.xml --expdoc config/experimentparameters.xml b) execution of some detailed steps perl RNAseq.pl --step configdoc configurationparameters.xml --expdoc experimentparameters.xml Steps Description: Step 1: sequencing quality && contamination check (fastqc & fastqscreen) Step 2: trimming && downsampling (seqtk) Step 3: Aligning (tophat) Step 4: transcripts assembly && quantification (cufflinks and cuffmerge) Step 5: differential expression (cuffquant, cuffdiff and cuffnorm) Step 6: htseq-count (gets read counts for genes) + DESeq2 differential expression Step 7: BedGraph and BigWig files for genome browsers Step 8: GSEA for specific gene sets over the different comparisons done with cuffdiff Step 9: gene fusion prediction 8

10 6. Output files nextpresso produces different output directories and log files depending on the executed steps. A simulated situation is shown below (screen capture) with the created output files and directories: a) fastqc directory, that contains the summary of the sequencing quality check for each of the samples. b) fastqscreen directory, that contains the summary of the cross-contamination check for each of the samples. c) trimmedsamples directory, with the FASTQ files trimmed to the specified nucleotide position (when this step is executed, the input files fed to the alignment step are the new ones created here). d) downsampledsamples directory, with the downsampled FASTQ files (not shown here as it was not executed). e) alignments directory, with the output files produced during the alignment step for each one of the samples, together with an alignment summary containing alignment percentages. f) bigwiffilesdir directory, with the BedGraph and BigWig files needed to visualize read alignments in a genome browser (like Ensembl or the UCSC Genome Browser). g) cufflinks directory, with the calculated transcript abundance in each sample (FPKM values). It also contains a Pearson correlation test and PCAs that show similarity among replicates. Furthermore, when correction of transcript expression is performed with spike-ins, the corresponding files will be stored here. h) cuffmerge directory, that contains a file with a merge of the original transcript annotation plus the additional annotation generated by cufflinks. i) cuffquant directory that contains intermediate files derived from the alignment files, required by cuffnorm and cuffdiff. j) cuffnorm directory, that contains inter-sample quantification of transcript abundance (FPKM values). k) cuffdiff directory, with differential expression files generated by cuffdiff for each one of the comparisons. l) htseqcount directory, with the output files generated by Htseqcount that later are fed to DESeq2. m) deseq directory, with the differential expression test performed with DESeq2. n) GSEA directory, with the gene set enrichment analysis of gene signatures across the different comparisons. o) fusion directory, with predicted gene fusions (not shown here). These directories are accompanied by their corresponding log files, that show details of the execution of each step. 9

Sequence Analysis Pipeline

Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation