Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted the analysis to chromosome 12 in order to speed up the alignment process. Once the reads have been aligned you can process the alignments using Samtools and display the alignments in the Ensembl browser alongside some of our own annotation. Overview 1
Command-line actions are coloured green. Please note that most commands are wrapped across more than one line. URLs are coloured blue. You should have a folder entitled EnsemblRNASeqPractical that contains the following files: Bam/ Doc/ EnsemblRNASeqPractical.txt Fastq/ 2cell_chr12_R1.fastq 2cell_chr12_R2.fastq 6hpf_chr12_R1.fastq 6hpf_chr12_R2.fastq Genome/ chr12.fasta Step 1: Index the genome file First we need to index the genome file so that BWA can use it - we do this using the BWA index command. The following is one command, wrapped over two lines: BWA command to index the genome fasta file: /opt/bwa-0.6.1/bwa index -a bwtsw Step 2: Align the reads to the genome Once that has finished we can start the alignment. We are using 2 lanes of 76bp paired end reads, that gives us 4 files, 2 for each lane containing the 1st and 2nd reads respectively. We align each of the lanes independently using BWA aln creating.sai files. Command to align 2cell data, 1 st reads: home/training/desktop/ensemblrnaseqpractical/bam/2cell_chr12_r1.sai /home/training/desktop/ensemblrnaseqpractical/fastq/2cell_chr12_r1.fastq Command to align 2cell data, 2 nd reads: home/training/desktop/ensemblrnaseqpractical/bam/2cell_chr12_r2.sai /home/training/desktop/ensemblrnaseqpractical/fastq/2cell_chr12_r2.fastq 2
Command to align 6hpf data, 1 st reads: /home/training/desktop/ensemblrnaseqpractical/bam/6hpf_chr12_r1.sai /home/training/desktop/ensemblrnaseqpractical/fastq/6hpf_chr12_r1.fastq Command to align 6hpf data, 2nd reads: /home/training/desktop/ensemblrnaseqpractical/bam/6hpf_chr12_r2.sai /home/training/desktop/ensemblrnaseqpractical/fastq/6hpf_chr12_r2.fastq The -n 37 parameter allows BWA to include up to 37 mismatches in the alignment ie: half the read length. The -i 76 parameter means that we do not want any insertions in the alignment over the full length of the reads. Step 3: Create SAM files Once the alignments have run we need to create the SAM file from the pairs. We process the.sai files along with the genome (chromosome) and fastq files using BWA sampe this will produce a single SAM file for each sample. Command to make sam files for 2cell data: /opt/bwa-0.6.1/bwa sampe -A -a 200000 -f home/training/desktop/ensemblrnaseqpractical/bam/2cell_chr12.sam /home/training/desktop/ensemblrnaseqpractical/bam/2cell_chr12_r1.sai /home/training/desktop/ensemblrnaseqpractical/bam/2cell_chr12_r2.sai /home/training/desktop/ensemblrnaseqpractical/fastq/2cell_chr12_r1.fastq /home/training/desktop/ensemblrnaseqpractical/fastq/2cell_chr12_r2.fastq Command to make sam files for 6hpf data: /opt/bwa-0.6.1/bwa sampe -A -a 200000 -f /home/training/desktop/ensemblrnaseqpractical/bam/6hpf_chr12.sam /home/training/desktop/ensemblrnaseqpractical/bam/6hpf_chr12_r1.sai /home/training/desktop/ensemblrnaseqpractical/bam/6hpf_chr12_r2.sai /home/training/desktop/ensemblrnaseqpractical/fastq/6hpf_chr12_r1.fastq /home/training/desktop/ensemblrnaseqpractical/fastq/6hpf_chr12_r2.fastq The -A parameter tells BWA to discard the estimation of insert size - this is because BWA is expecting genomic reads rather than transcriptome reads. Transcriptome read pairs can span introns which causes problems when estimating insert size. The -a 200000 parameter tells BWA to use a maximum allowed insert size (distance between the 2 pairs) of 200Kb, this effectively acts as a maximum intron length for the alignment. In case samtools does not work, run this command: export LD_LIBRARY_PATH=/opt/zlib-1.2.6 3
Step 4: Create BAM files, sort and index them Once the pairs have been processed into sam files we use samtools to process the reads into BAM files. Command to make BAM files for 2cell data: /opt/samtools/samtools view -S -b /home/training/desktop/ensemblrnaseqpractical/bam/2cell_chr12.sam -o Command to make BAM files for 6hpf data: /opt/samtools/samtools view -S -b /home/training/desktop/ensemblrnaseqpractical/bam/6hpf_chr12.sam -o Here the -S parameter specifies the input is in SAM format, -b specifies to output in BAM format In order to use the files on the website they must be sorted and indexed, this can be done as follows. BWA command to sort the 2cell file: /opt/samtools/samtools sort /home/training/desktop/ensemblrnaseqpractical/bam/2cell_chr12_sorted BWA command to sort the 6hpf file: /opt/samtools/samtools sort /home/training/desktop/ensemblrnaseqpractical/bam/6hpf_chr12_sorted Note that the.bam extension is appended to the file name when using samtools sort Command to index the 2cell BAM file: /opt/samtools/samtools index /home/training/desktop/ensemblrnaseqpractical/bam/2cell_chr12_sorted.bam Command to index the 6hpf BAM file: /opt/samtools/samtools index /home/training/desktop/ensemblrnaseqpractical/bam/6hpf_chr12_sorted.bam Samtools flagstat will give you some basic statistics about the alignments: Flagstat command for 2cell data: /opt/samtools/samtools flagstat Flagstat command for 6hpf data: /opt/samtools/samtools flagstat 4
Step 5: View results BAM files are often quite large and are unsuitable for uploading to a website, so in order to view the alignments in the Ensembl browser you need to host the sorted and indexed BAM files on either a webserver or an ftp site. Both the sorted bam file and the index file ending.bai are needed to view the alignments on the website. The website requires the bam file URL to be entered, it then looks for a.bai file with the same name in the same directory. For convenience we have already set up an ftp site that contains the files you just created, if you enter the following URL into your web browser: ftp://ftp.sanger.ac.uk/pub/users/sw4/danio/practical/bam You will see 2 directories: Exons and Introns. Exons contain the alignments to chr12 that you just made. Introns contains spliced alignments that we created using the RNASeq pipeline for several tissues including the 2cell and 6hpf lanes we have used here. We have chosen ENSDARG00000055381 as a good example of a chromosome 12 gene with differential expression highlighted by the 6hpf and 2cell lanes, though the alignments cover the whole of chr12 if you want to look for other interesting examples. 1. To load the alignments first go the browser and go to www.ensembl.org 2. Enter ENSDARG00000055381 into the search box to take you to the gene view page. It is a gene called "bambia". 3. Click on the location tab at the top of the page to take you to the view of the region on the chromosome. 4. Now we can load our BAM files. Click on the "Configure this page" button on the left panel, this opens a configuration panel, we want the "custom data" tab at the top right. 5. To load the BAM files click on "Attach Remote File" on the left hand panel. 6. Here enter the URL of the files: ftp://ftp.sanger.ac.uk/pub/users/sw4/danio/bam/exons/2cells.bam 7. Select the data format as BAM and name the track. If you have a Ensembl account you can store the track in your account to use another time. 8. Then do the same for the 6hpf file and any of the Intron files you might like to view. ftp://ftp.sanger.ac.uk/pub/users/sw4/danio/bam/exons/6hpf.bam ftp://ftp.sanger.ac.uk/pub/users/sw4/danio/bam/introns/6hpf.bam ftp://ftp.sanger.ac.uk/pub/users/sw4/danio/bam/introns/2cells.bam 9. Once you have attached the remote files you should be able to see them in the region view browser, if they do not show up you may need to turn them on by going to "Configure this page" -> "Your data" 5