Maize genome sequence in FASTA format. Gene annotation file in gff format

Size: px

Start display at page:

Download "Maize genome sequence in FASTA format. Gene annotation file in gff format"

Jeremy Melton
5 years ago
Views:

1 Exercise 1. Using Tophat/Cufflinks to analyze RNAseq data. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise web page: Please consult the PDF file with instructions on how to access and use the Lab workstations for the exercises posted on this page. If you would like to carry computations outside the workshop you will need to reserve a Linux workstation of the BioHPC Lab as described in BioHPC user guide Step 2. Copy the exercise files to the directory /workdir. As /workdir is shared by many users, you will need to create a sub directory under /workdir. You will need the following files: s_1_sequence.txt and s_5_sequence.txt : maize.fa: ZmB73_5a.59_WGS.gff : Illumina data files in FASTQ format Maize genome sequence in FASTA format Gene annotation file in gff format cd /workdir mkdir MyUserName cd MyUserName cp /shared_data/rnaseq/*. Note: a) Replace MyUserName with your own login user name; b) File names are sometimes very long. To make typing easier, you can type the first few letters then use the Tab key to auto finish the file name; c) * can be used a wildcard for file names in the commands. Step 3. Build a reference genome index using bowtie build, and then run tophat. You only need to run bowtie build once for the same genome file. In this exercise, we will run tophat twice, first time guided by the gff gene annotation file, second time without the gene annotation.

2 bowtie-build maize.fa maize >& bowtie-build.log & /programs/bin/tophat p 7 -o s1_guided -G ZmB73_5a_WGS.gff --no-novel-juncs maize s_1_sequence.txt >& tophat_1.log & /programs/bin/tophat p 3 -o s1_unguided maize s_1_sequence.txt >& tophat_2.log & mv s1_unguided/accepted_hits.bam s1_unguided_tophat.bam mv s1_guided/accepted_hits.bam s1_guided_tophat.bam IMPORTANT NOTE: The above example uses old data with 36bp reads, which requires use of the previous version of Tophat (1.2.0). Currently Ilumina sequencing produces longer reads, for which the new version of Tophat should be used (version 1.3.0). The newest version of Tophat can be invoked just as tophat in the command line, without version string. Samtools will produce the same file name for each run, which you might want to change to avoid confusion. In this example, I used the mv command to change the file name to s1_guided_tophat.bam, and at the same time move the file one directory up. Step 4. Using Samtools to index the bam file, then visualize the alignment using IGV. samtools index s1_unguided_tophat.bam samtools index s1_guided_tophat.bam This step is optional. It is for visualizing the read alignment to the genome. After running samtools index, you will see a new file called s1_tophat.bam.bai. You have three options to visualize the results with IGV: 1) run IGV on Lab workstation with VNC; 2) run IGV on Lab workstation with ssh and xming; 3) download output files to your own computer and run IGV there. 1. To run IGV over VNC on the Lab workstation you need to connect to the workstation using VNC (please consult BioHPC Lab user guide for more info on VNC connections). IGV is accessible as an icon on the left on Linux desktop. Double click the icon to start IGV. 2. To run IGV on Lab workstation with ssh and xming you need to connect to the workstation using ssh with X11 forwarding enabled. Start Xming on your local computer (or allow external X11 display if your local computer is Linux). Start IGV from the command line in the ssh window. Please consult BioHPC Lab user guide for more info on ssh/xming/x If you decide to run IGV on your local computer you will need to copy both the s1_tophat.bam and s1_tophat.bam.bai files to your local computer as well as maize.fa and ZmB73_5a_WGS.gff;

3 IGV software needs all of them. Start IGV software as per the online instructions a) Start the IGV software (depends on the way you run IGV, see above). b) Import the genome and gene annotation. (Using the fasta file as the sequence file, the gff file as the gene file). c) Load data Load the bam and bai files. If you have multiple samples, you can load each one as an individual track. d) Navigate around IGV Step 5. Run cuffdiff on the BAM files. cufflinks -G ZmB73_5a_WGS.gff s1_guided_tophat.bam cuffdiff ZmB73_5a_WGS.gff s1_guided_tophat.bam s5_guided_tophat.bam Note: Use cufflinks if you have only one sample. Use cuffdiff if you have multiple samples. Step 6. Transfer the output file transcripts.expr to your local computer. You can open this file using Excel. This file gave you the FPKM values for each gene to represent the expression level. After you feel more comfortable with running this software, it is highly recommended that you read the manual pages for Tophat and Cufflinks: Please remember that /workdir is shared by many users, and each workstation has different /workdir. After you finish a session, you will need to copy the files that you want to keep to your home directory /home/myusername. Once the files are copied to your home directory, you will be able to access them from any workstations.

4 Exercise 2. Using CBSU pipeline to analyze RNAseq data. In this exercise we will use CBSU RNA Seq pipeline to process the data from Exercise 1. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise web page: Please consult the PDF file with instructions on how to access and use the Lab workstations for the exercises posted on this page for more information. If you would like to carry computations outside the workshop you will need to reserve a Linux workstation of the BioHPC Lab as described in BioHPC user guide Step 2. Copy the exercise files to the directory /workdir. As /workdir is shared by many users, you will need to create a sub directory under /workdir. You will need the following files: s_1_sequence.txt and s_5_sequence.txt : maize.fa: ZmB73_5a.59_WGS.gff : Rnaseq.conf: Illumina data files in FASTQ format Maize genome sequence in FASTA format Gene annotation file in gff format CBSU RNASeq pipeline options file cd /workdir mkdir MyUserName cd MyUserName cp /shared_data/rnaseq/*. cp /programs/rnaseqpipeline/rnaseq.conf. NOTE: If you have already done Exercise 1 you need to execute only the last command in the box above to copy the pipeline s options file. Step 3. Build a reference genome index using bwa. bwa index maize.fa >& bwa-index.log & Step 4. Modify options file so the options used are appropriate for your particular data. The template options file has been created for long reads data, our workshop data contains short reads so we need to modify some parameters, also, our files are in different directory than the template files and may have

5 different names. Open rnaseq.conf file in your favorite text editor and change the following parameters: fastq sequence file, annotation file, genome file, number of mismatches allowed for genome alignment and number of mismatches allowed for junction alignment. Number of BWA threads should also be modified to reflect the number of cores on your current machine (8 on remote Lab workstations). # annotation file /workdir/myloginname/zmb73_5a_wgs.gff # genome file (index files must be present in the same directory!) /workdir/myloginname/maize.fa # fastq sequence file /workdir/myloginname/ s_1_sequence.txt # output files core name, output files will be named core_namexxx, where XXX are various suffixes s_1_sequence_cbsu # number of mismatches allowed (edit distance), for genome alignment 2 # number of mismatches allowed (edit distance), for junction alignment 0 Step 5. Run the pipeline. /programs/rnaseqpipeline/rnaseq.pl >& rnaseq_1.log & Step 6. Change the options file to process s_5_sequence.txt data file instead of s_1_sequence.txt as in the step 4 and 5 above. Run the pipeline for s_5_sequence.txt.

6 Step 7. Analyze the results. NOTE: Pre computed output files are available in /shared_data/rnaseq/output.

replace my_user_id in the commands with your actual user ID

replace my_user_id in the commands with your actual user ID Exercise 1. Alignment with TOPHAT Part 1. Prepare the working directory. 1. Find out the name of the computer that has been reserved for you (https://cbsu.tc.cornell.edu/ww/machines.aspx?i=57 ). Everyone