Read mapping with BWA and BOWTIE

Size: px

Start display at page:

Download "Read mapping with BWA and BOWTIE"

Annice Beasley
5 years ago
Views:

1 Read mapping with BWA and BOWTIE Before We Start In order to save a lot of typing, and to allow us some flexibility in designing these courses, we will establish a UNIX shell variable BASE to point to the current filesystem location of the TACC NGS tutorial material. For any shell you open that accesses Lonestar during today's tutorial, please enter the following command: export BASE=/corral-repl/utexas/BioITeam/tacc_ngs Read Mapping Tutorial Your objective today is to map the paired-end sequence data in these files to a reference sequence, then convert the results to a sorted, indexed BAM file. You can use BWA or Bowtie to do the alignment, and SAMtools facilitate the other tasks. In the latter half of today's work, you will use these results files to call some variants in the human genome. Data Files $BASE/human_variation/allseqs_R1.fastq $BASE/human_variation/allseqs_R2.fastq The Reference Sequence $BASE/human_variation/ref/hs37d5.fa BWA and Bowtie (as well as most other throughput-oriented aligners) require that you provide not just the reference sequence, but an index of that sequence as well. These can be generated by bwa index or bowtie-build, respectively. To save time, we have provided indices formatted for BWA and Bowtie. Look in the $BASE/human_variation/ref to see additional index files besides hs37d5.fa login1$ ls $BASE/human_variation/ref/ hs37d5.fa hs37d5.fa.3.ebwt hs37d5.fa.ann hs37d5.fa.rev.1.ebwt hs37d5.fa.1.ebwt hs37d5.fa.4.ebwt hs37d5.fa.bwt hs37d5.fa.rev.2.ebwt hs37d5.fa.2.ebwt hs37d5.fa.amb hs37d5.fa.pac hs37d5.fa.sa These additional files are generated by the index commands. You can index your own reference sequence very easily, but be warned it may take a few hours so be sure to do it using a batch script. See if you can find some other modules on Lonestar that pertain to Alignment

2 login1$ module key Alignment ---- The following modules match your search criteria: "Alignment" ---- beast: beast/1.7, beast/1.7.2 Tool for Bayesian MCMC analysis of molecular sequences bismark: bismark/0.7.2, bismark/0.7.4 bismark - A tool to map bisulfite converted sequence reads and determine cytosine blast: blast/ NCBI BLAST+ sequence alignment package. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. bowtie: bowtie/0.12.8, bowtie/2.0.0b6 Ultrafast, memory-efficient short read aligner bwa: bwa/0.6.1, bwa/0.6.2 bwa - Burrows-Wheeler Alignment Tool clustalw2: clustalw2/2.1 Tool for multiple sequence alignment gsnap: gsnap/ , gsnap/ gsnap - Genomic Short-read Nucleotide Alignment Program mafft: mafft/6.864, mafft/6.903 Multiple alignment program for amino acid or nucleotide sequences mummer: mummer/3.23 MUMmer - A modular system for the rapid whole genome alignment of finished or draft sequence plink: plink/1.07 plink - Whole genome association analysis toolset tophat: tophat/1.4.1, tophat/2.0.4 TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. Obviously, the word Alignment pops up the secription of a few non-ngs modules, but you can see the utility in modules' keyword search capabilities. Submit some alignment jobs to Lonestar

3 First, let's create a working directory: mkdir $WORK/bwa-align && cd $WORK/bwa-align. Now, we will demonstrate how to submit a BWA alignment job to Lonestar. It's up to you to figure out how to do the same for Bowtie. Using a text editor, create the following file. Feel free to copy and paste or copy it from $BASE/align_bwa_01.sh Hint cp $BASE/align_bwa_01.sh. align_bwa_01.sh #!/bin/bash #$ -V #$ -cwd #$ -pe 1way 12 #$ -q normal #$ -l h_rt=01:00:00 #$ -A NGS-ACES #$ -m be #$ -M vaughn@tacc.utexas.edu #$ -N align_bwa_01 module load bwa/0.6.1 BASE="/corral-repl/utexas/BioITeam/tacc_ngs" time bwa aln $BASE/human_variation/ref/hs37d5.fa $BASE/human_variation /allseqs_r1.fastq > r1.sai time bwa aln $BASE/human_variation/ref/hs37d5.fa $BASE/human_variation /allseqs_r2.fastq > r2.sai time bwa sampe $BASE/human_variation/ref/hs37d5.fa r1.sai r2.sai $BASE /human_variation/allseqs_r1.fastq $BASE/human_variation/allseqs_R2.fastq > hs37d5_allseqs_bwa.sam Once you've edited the job file, submit it to Lonestar's queue, then check the status of the job to make sure it went in. Submit your job file using qsub login1$ qsub align_bwa_01.sh (... A large amount of text will scroll by followed by text resembling this...) Your job ("align_bwa_01") has been submitted Check the status of your job in the queue

4 login1$ qstat job-id prior name user state submit/start at queue slots ja-task-id align_bwa_ vaughn r 10/03/ :00:22 normal@c ls4.tacc.utexa 12 Now, figure out how to align the same sequences using Bowtie Hint The specific commands for Bowtie, which you will need to work into your own job file is module load bowtie/ time bowtie --chunkmbs 256 -x -t -S $BASE/human_variation/ref/hs37d5.fa -1 $BASE/human_variation/allseqs_R1.fastq -2 $BASE/human_variation /allseqs_r2.fastq hs37d5_allseqs_bowtie.sam Notice that Bowtie can handle paired-end mapping and conversion to SAM all in one command! If you run into trouble creating a job file for Bowtie from scratch, you can copy over ours to use as a template. cp /corral-repl/utexas/bioiteam/tacc_ngs/align_bowtie_01.sh. You're running two computing jobs out of the same directory. This will work, but you need to make sure that none of the files created by the two tasks have the same name or both will end up failing or reporting erroneous results. In our case, we've assured this by using two different algorithms (BWA and Bowtie) and by choosing different output names hs37d5_allseqs_bwa.sam and hs37d5_allseqs_b owtie.sam Setting up BAM conversion as a dependency We're going to use SAMTools to convert the results of your alignment jobs from the text version of the SAM format to the more efficient and useful binary BAM format. Along the way we'll sort the reads by chromosome position and create an index of the BAM file. " But Wait! ", you say, " our alignments haven't run yet!". We're going to kill two birds with one stone: We will show you the basics of setting up SAM->sorted BAm conversion and we are also going to demonstrate how to orchestrate a simple workflow using job dependency on Lonestar. This means the downstream conversion tasks won't kick off until the alignment tasks have completed successfully. First, let's assemble the job file we need to convert the BWA output file hs37d5_allseqs_bwa.sam from SAM to BAM. Using a text editor, create the following file. Feel free to copy/paste from the Wiki or copy it from /corral-repl/utexas/bioiteam/tacc_ngs/samtools_bwa_01.sh. Don't submit samtools_bwa_01.sh to the Lonestar queue yet until you've been shown how to establish a job dependency, or the task will fail

5 samtools_bwa_01.sh #!/bin/bash #$ -V #$ -cwd #$ -pe 1way 12 #$ -q normal #$ -l h_rt=02:00:00 #$ -A NGS-ACES #$ -m be #$ -M #$ -N samtools_bwa_01 # Load the samtools module module load samtools # Set up a variable so we don't have to # keep typing in all this path BASE="/corral-repl/utexas/BioITeam/tacc_ngs" # For readability, these are listed one command per line. If you # want to chain them all together into a set of linearly dependent # commands, check out the samtools_bwa_02.sh file for an example # of using the && operator to do just that # # samtools has a suite of sub-commands documented here # # # 'view' is used for filtering as well as simple conversion samtools view -S -b hs37d5_allseqs_bwa.sam > hs37d5_allseqs_bwa.bam # 'sort' lets you sort by position or read name samtools sort hs37d5_allseqs_bwa.bam hs37d5_allseqs_bwa.sorted # 'index' lets you create an index of a BAM file samtools index hs37d5_allseqs_bwa.sorted.bam Setting up a job dependency Check on the status of your bwa job (using qstat). It's probably still running. If it is, and you do an 'ls' on the output directory, either the 'hs37d5_allseqs_bwa.sam' file will be absent or will still be in process of being written. So, if we start running the SAMtools from above assuming that file exists and is complete, we're going to have a situation. We want to tell Lonestar "Don't start running the samtools tasks until AFTER bwa has completed". To do this, you just need to know the job-id of the BWA task. Find out the job-id of your BWA task

6 login1$ qstat job-id prior name user state submit/start at queue slots ja-task-id align_bwa_ vaughn r 10/03/ :00:22 normal@c ls4.tacc.utexa 12 The job-id of MY bwa task is What's yours? Establishing a job dependency is easy. Instead of just typing: qsub samtools_bwa_01.sh to tell Lonestar to start running SAMTools ASAP, you tell it this instead: qsub -hold_jid samtools_bwa_01.sh where you would replace with the job-id of your BWA task. Challenge Mode Hints Create a script to process your Bowtie results using SAMtools Submit it as a dependency to your Bowtie alignment task See samtools_bowtie_01.sh in the tacc_ngs directory if you get stuck.

Calling variants in diploid or multiploid genomes

Calling variants in diploid or multiploid genomes Diploid genomes The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen: 1. 2.